Date: Thu, 1 Aug 2002 14:18:04 -0500 (CDT)
From: Gilles Detillieux <grdetil@scrc.umanitoba.ca>
To: Joe R. Jah <jjah@cloud.ccsf.cc.ca.us>
Cc: htdig-general@lists.sourceforge.net
Subject: Re: [htdig] BSD/OS-4.3, htdig-3.1.6 core dumped

According to Joe R. Jah:
> This morning after a long period of smooth operation htdig dumped core.
> Here is the relevant rundig -v line:
> 
> 7127:7129:3:http://www.ccsf.edu/Offices/Office_of_Instruction/SU2001NC.xls: Segmentation fault - core dumped
> 
> Here is a gdb back trace:
> 
> $ gdb htdig htdig.core
> GNU gdb 
> Copyright 1998 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you are
> welcome to change it and/or distribute copies of it under certain conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for details.
> This GDB was configured as "i386-unknown-bsdi4.3"...
> Core was generated by `htdig'.
> Program terminated with signal 11, Segmentation fault.
> Reading symbols from /usr/lib/libz.so...done.
> Reading symbols from /usr/lib/libstdc++.so.1...done.
> Reading symbols from /shlib/libgcc.so.1...done.
> Reading symbols from /shlib/libc.so.2...done.
> Reading symbols from /shlib/ld-bsdi.so...done.
> #0  0x8107744 in __rethrow () at HtRegex.h:59
> 59          int         match(const String& str, int nullmatch, int nullstr) { return match(str.get(), nullmatch, nullstr); }
> (gdb) bt
> #0  0x8107744 in __rethrow () at HtRegex.h:59
> #1  0x810780e in __rethrow () at HtRegex.h:59
> #2  0x81078fa in __rethrow () at HtRegex.h:59
> #3  0x8107dcc in __frame_state_for () at HtRegex.h:59
> #4  0x8107385 in __throw () at HtRegex.h:59
> #5  0x8106167 in __builtin_new () at HtRegex.h:59
> #6  0x8106032 in __builtin_vec_new () at HtRegex.h:59
> #7  0x806daf4 in String::allocate_space (this=0x804786c, len=16777216) at String.cc:584
> #8  0x806db83 in String::reallocate_space (this=0x804786c, len=16777216) at String.cc:614
> #9  0x806d218 in String::append (this=0x804786c, 
>     s=0x8046fd0 "nbsp;</TD>\n<TD>&nbsp;</TD>\n<TD>&nbsp;</TD>\n<TD>&nbsp;</TD>\n<TD>&nbsp;</TD>\n<TD>&nbsp;</TD>\n<TD>&nbsp;</TD>\n<TD>&nbsp;</TD>\n<TD>&nbsp;</TD>\n<TD>&nbsp;</TD>\n<TD>&nbsp;</TD>\n<TD>&nbsp;</TD>\n<TD>&nbsp;</T"..., slen=2048) at String.cc:157
> #10 0x8063159 in ExternalParser::parse (this=0x8605440, retriever=@0x8047ac0, base=@0x81a8780) at ExternalParser.cc:539
> #11 0x805af84 in Retriever::RetrievedDocument (this=0x8047ac0, doc=@0x81a8600, ref=0x8859000) at Retriever.cc:577
> #12 0x805ab09 in Retriever::parse_url (this=0x8047ac0, urlRef=@0x8a729c0) at Retriever.cc:473
> #13 0x805a3a5 in Retriever::Start (this=0x8047ac0) at Retriever.cc:292
> #14 0x8060a96 in main (ac=7, av=0x8047cbc) at htdig.cc:338
> #15 0x8053a3e in __start ()
> 
> I appreciate any insight.

Sure looks to me like you ran out of virtual memory while htdig was
reading the HTML output of your .xls to HTML external converter.
It occurs to me that ExternalParser::parse doesn't check max_doc_size
when reading the external converter output, which may not be such a smart
thing, as a converter can easily spit out more than htdig can handle.
I sense another patch coming on...  Ah, yes, here it comes now...

--- htdig/ExternalParser.cc.orig	Wed Jan  9 16:23:25 2002
+++ htdig/ExternalParser.cc	Thu Aug  1 14:11:07 2002
@@ -535,8 +535,15 @@ ExternalParser::parse(Retriever &retriev
 	{
 	    char	buffer[2048];
 	    int		length;
-	    while ((length = fread(buffer, 1, sizeof(buffer), input)) > 0)
+	    int		nbytes = config.Value("max_doc_size");
+	    while (nbytes > 0 &&
+			(length = fread(buffer, 1, sizeof(buffer), input)) > 0)
+	    {
+		nbytes -= length;
+		if (nbytes < 0)
+		    length += nbytes;
 		newcontent.append(buffer, length);
+	    }
 	}
     }
     fclose(input);

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)
