Date: Thu, 16 Dec 2004 16:40:58 -0600 (CST)
From: Gilles Detillieux <grdetil@scrc.umanitoba.ca>
To: "ht://Dig mailing list" <htdig-general@lists.sourceforge.net>
Cc: Gilles Detillieux <grdetil@scrc.umanitoba.ca>,
     Gilbert Detillieux <gedetil@cs.umanitoba.ca>
Subject: [htdig] external_parsers bug (was Re: [htdig] pdf indexing problems)

As a followup to the recent thread between Jon, David and Steve, I just
wanted to let you all know that I discovered a bug in the external_parsers
handling of htdig (versions 3.1.6 and 3.2.0b6).

Jon Sorensen reported verbose htdig output like this:
>     Content-Type: application/pdf
>     Header line:
>     returnStatus = 0
>     Read 8192 from document
>     Read 8192 from document
>     Read 8192 from document
>     Read 8192 from document
>     Read 907 from document
>     Read a total of 361355 bytes
>     word: Read@0
>     word: 8192@4
>     word: from@9
>     word: document@13
>     word: Read@21
>     word: 8192@26
>     word: from@30
>     word: document@35
>     word: Read@43
>     word: 8192@47
>     word: from@52
>     word: document@56

I've seen that before in posts to htdig-general, but couldn't make sense
of that.

Jon also asked:
>     I posted a question recently about indexing pdfs with doc2html
>     but I can't figure out what the problem is. I believe that the conifg is correct
>     but there may be a problem there. when I dig a number of pdfs the files
>     are read but the words indexed are not correct:
>     word: Read@0
>     word: 8192@4
>     word: from@9
>     Does anyone know what this indicates?
>     From looking at the message archives it seems that others have had this problem
>     but there weren't any solutions posted in the messages

It appears that htdig's stdout is being fed back into the parser, which
seemed to defy all logic, until I figured out the cause on a new test
system, which was also having problems indexing PDFs.  When I ran the
external converter manually, I got the error:

/usr/local/bin/perl: bad interpreter: No such file or directory

The problem was that the script began with "#!/usr/local/bin/perl",
which worked fine on the older system, but not on the newer one.
That explained why PDF indexing didn't work (htdig couldn't "exec"
the external_parsers script), but not why htdig was eating its own output.

Then I realized what was going on:  htdig does a fork() and execv()
to call the script, and if the execv() fails the child process exits,
as it should.  But, the child process exits using the exit() function,
rather than _exit(), which is a no-no in a child process.  The problem
is that the fork() makes a duplicate of everything in the parent
process, including all the parent's I/O buffers.  If the child process
calls exit(), it flushes its copy of the parent's stdout buffer, so a
copy of much of the parent's verbose output gets flushed out into the
child's pipe, which the parent reads and parses.  The fix is to change
htdig/ExternalParser.cc like this:

--- htdig/ExternalParser.cc.orig	2004-05-28 08:15:14.000000000 -0500
+++ htdig/ExternalParser.cc	2004-12-16 16:37:14.000000000 -0600
@@ -280,7 +280,11 @@ ExternalParser::parse(Retriever &retriev
 	// Call External Parser
 	execv(parsargs[0], parsargs);
 
-	exit(EXIT_FAILURE);
+	perror("execv");
+	write(STDERR_FILENO, "External parser error: Can't execute ", 37);
+	write(STDERR_FILENO, parseargs[0], strlen(parseargs[0]));
+	write(STDERR_FILENO, "\n", 1);
+	_exit(EXIT_FAILURE);
     }
 
     // Parent Process

Of course, this is only a problem if the external parser/converter script
can't be exec'ed by htdig, so if all is working well, this bug won't be
an issue.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)
