Date: Thu, 7 Feb 2002 15:42:59 -0600 (CST) From: Gilles Detillieux To: "ht://Dig mailing list" Subject: [htdig] PATCH - allow NUL characters in text/* documents This patch fixes a problem in the 3.1.x text/html and text/plain parsers. The parsers stop parsing as soon as they encounter an ASCII NUL (0) character. While this problem has been around since the very early days of htdig, it was only brought to light after the release of 3.1.6. I guess that means there's not a lot of text documents out there that contain nulls, thankfully. However, if this is a problem for you, and fixing the documents isn't an easy option, you may want to apply this patch. NOTE: This patch may not be for everyone! It will more than likely slow down parsing of documents, particularly on slower systems with not a lot of RAM. The reason is the parser does an extra pass through the in-memory copy of the document to find and replace nulls - this will cause extra paging if the whole document doesn't stay in htdig's set of resident pages. Apply this patch in your main htdig-3.1.6 source directory using the command: patch -p0 < this-message-file --- htdig/HTML.cc.orig Thu Jan 31 17:47:17 2002 +++ htdig/HTML.cc Thu Feb 7 15:00:15 2002 @@ -146,6 +146,8 @@ HTML::parse(Retriever &retriever, URL &b if (contents == 0 || contents->length() == 0) return; + contents->replace('\0', ' '); + base = &baseURL; // --- htdig/Plaintext.cc.orig Thu Jan 31 17:47:17 2002 +++ htdig/Plaintext.cc Thu Feb 7 15:00:33 2002 @@ -40,6 +40,8 @@ Plaintext::parse(Retriever &retriever, U if (contents == 0 || contents->length() == 0) return; + contents->replace('\0', ' '); + unsigned char *position = (unsigned char *) contents->get(); unsigned char *start = position; static int minimumWordLength = config.Value("minimum_word_length", 3); -- Gilles R. Detillieux E-mail: Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 _______________________________________________ htdig-general mailing list To unsubscribe, send a message to with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html