
From grdetil@scrc.umanitoba.ca Sun Feb 14 16:38:08 1999
Date: Fri, 12 Feb 1999 12:48:42 -0600 (CST)
From: Gilles Detillieux <grdetil@scrc.umanitoba.ca>
To: htdig@htdig.org
Cc: htdig@htdig.org, MSQL_User@st.hhs.nl
Subject: [htdig] Re: external parser causes htdig core dump


According to Frank Richter:
> > Could you set up a configuration file that digs only this document, e.g.:
> > 
> > start_url:	http://www.tu-chemnitz.de/wirtschaft/bwl2/download/portrait.doc
> > 
> > and then run htdig with -vvvvvv, using this configuration, and your
> > current parse_word_doc.pl script.  I'd like more info about what's
> > happening prior to the core dump.
> 
> I did it, see attached file. You see many many binary data...
> Of course a workaround is to change the external parser to avoid such
> garbage, but htdig should be robust enough...

The log file you sent me unfortunately didn't tell me much, but I did
manage to reproduce the problem.  I realised, when I saw how big the
portrait.doc file was, that my htdig was truncating it.  I increased
max_doc_size to 2000000, and sure enough, htdig dumped core on your
document.

In looking at your stack backtrace previously, I was so focused on the
garbage words that got_word was getting, that I failed to realise the
problem was the value for heading, which was way out of range, and was
being used, unchecked, as an array subscript.

The problem you reported seems to be different than the one Jesse had,
which I still can't reproduce, but I hope that with this patch, and my
earlier fixes to ExternalParser.cc, it'll solve that problem too!

Here's the patch for your problem, Frank.  Now, instead of getting a core
dump, you'll get a whole bunch of External parser error messages.  For the
sake of defensive programming, Retriever::got_word() should probably still
be fixed to check "heading" before using it as a subscript, but I decided
to put a check in ExternalParser.cc so the error can be reported there.

--- ./htdig/ExternalParser.cc.wordbug	Tue Feb  9 18:26:08 1999
+++ ./htdig/ExternalParser.cc	Fri Feb 12 12:22:52 1999
@@ -148,6 +148,7 @@
 
     String	line;
     char	*token1, *token2, *token3;
+    int		loc, hd;
     URL		url;
     while (readLine(input, line))
     {
@@ -164,8 +165,10 @@
 		  token2 = strtok(0, "\t");
 		if (token2 != NULL)
 		  token3 = strtok(0, "\t");
-		if (token1 != NULL && token2 != NULL && token3 != NULL)
-		  retriever.got_word(token1, atoi(token2), atoi(token3));
+		if (token1 != NULL && token2 != NULL && token3 != NULL &&
+			(loc = atoi(token2)) >= 0 && loc <= 1000 &&
+			(hd = atoi(token3)) >= 0 && hd < 12)
+		  retriever.got_word(token1, loc, hd);
 		else
 		  cerr<< "External parser error in line:"<<line<<"\n";
 		break;


-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.
