Patch to htdig 3.0.8b2 by Reni Seindal (seindal@webadm.kb.dk) I am currently adapting htdig for use as a search engine for the Danish Ministry of Culture, and for this I have made some changes to htdig and htsearch. I haven't included a patch with this message, but it can be found at ftp://webadm.kb.dk/pub/htdig.patch (16kb). This patch is relative to version 3.0.8b2. It should be applied just above the source tree, where the directory htdig-3.0.8b2 is. The changes to htdig are another timeout mechanism and fixes to the HTML parser. Changes to htsearch are the introduction of new variables for templates. I have had problems with the timeout mechanism under Linux 2.0.29 (installed from Slackware 3.2). The htdig code used alarm(2) to have the read from the socket interrupted if it took too long, but for some reason the read was restarted after the signal. I think Linux is misbehaving here, as the read should give EINTR. The interesting part is, that this problem only arose when the web-server queried was MicroSoft IIS. It never happened with any other kind of web-server. I haven't had the time and possibility to track the problem any further, but I have "fixed" it by changing the way timeouts work in htdig. Instead of using alarm(2) I have used select(2) in the Connection class. The timeout is thus handled at the same level as the input and I think the result is cleaner, simpler and more straightforward. Instead of trying to break out of a blocked syscall, we simply avoid performing a blocking call. This change affects htdig/Document.cc (removal of alarm and sigaction calls), htlib/Connection.h (adding a timeout method to set the timeout interval) and htlib/Connection.cc (adding a call to select in partial_read() if a timeout interval is set). I concur that fixing a problem in Linux by changing htdig is not The Right Thing(tm), but I do think the resulting code is simpler and more intuitive. The default behaviour of the Connection class is not modified, this requires a call to the Connection::timeout method to take place. I have only modified the behaviour of partial_read, but to be consistent the changes should be propagated to the other methods as well, ie write_partial and connect. I also noted, that the HTML parser was seriously broken, causing all SGML entities to be treated a word constituent. For entities such as   " etc. this is clearly not the case. Formerly the parsing of the document body was done in one loop, mapping SGML entities on the fly. I have changed this to two loops, the first removing comments, parsing tags and mapping SGML entities, leaving a simple text string for the second loop, which splits the text into individual words. This modification only affects htdig/HTML.cc. In htsearch I wanted to be able to number the matches, so I have introduced a CURRENT variable for use in templates. For each match it expands to the number of the match. This required a change to the calling sequence of the method Display::displayMatch, adding the number as a second argument. Changes affect htsearch/Display.cc and htsearch/Display.h. As I want the user to be able to select a number of matches to show per page, *and* have this value passed on to later pages, I have added a MATCH_LIST to go with FORMAT and METHOD. It generates a menu with the values from the new configuration parameter matches_per_page_list, marking the current value as default. The default for this parameter is "10 25 50 100 250 500". Changes affect htsearch/Display.cc and htcommon/defaults.cc ------------------------------------------------------------------------ diff -rc htdig-3.0.8b2.orig/htcommon/defaults.cc htdig-3.0.8b2/htcommon/defaults.cc *** htdig-3.0.8b2.orig/htcommon/defaults.cc Fri Aug 15 07:59:25 1997 --- htdig-3.0.8b2/htcommon/defaults.cc Sun Aug 17 17:24:19 1997 *************** *** 70,75 **** --- 70,76 ---- {"maintainer", "andrew@contigo.com"}, {"match_method", "or"}, {"matches_per_page", "10"}, + {"matches_per_page_list", "10 25 50 100 250 500"}, {"max_description_length", "60"}, {"max_doc_size", "100000"}, {"max_head_length", "512"}, diff -rc htdig-3.0.8b2.orig/htdig/Document.cc htdig-3.0.8b2/htdig/Document.cc *** htdig-3.0.8b2.orig/htdig/Document.cc Fri Aug 15 23:32:19 1997 --- htdig-3.0.8b2/htdig/Document.cc Sun Aug 17 15:00:55 1997 *************** *** 53,60 **** typedef SIG_PF SIGNAL_HANDLER; #endif - static Connection *current_connection; - //***************************************************************************** // Document::Document(char *u) --- 53,58 ---- *************** *** 243,273 **** } - static void - timeout() - { - if (debug > 1) - printf(" Timeout\n"); - current_connection->stop_io(); - - struct sigaction sa; - #ifdef _AIX - sa.sa_handler = (void(*)(int)) timeout; - #else - sa.sa_handler = (SIGNAL_HANDLER) timeout; - #endif - sigemptyset ((sigset_t *) &sa.sa_mask); - sigaddset ((sigset_t *) &sa.sa_mask, SIGALRM); - #if defined(SA_INTERRUPT) - sa.sa_flags = SA_INTERRUPT; - #else - sa.sa_flags = 0; - #endif - sigaction(SIGALRM, &sa, 0); - alarm(config.Value("timeout")); - } - - //***************************************************************************** // DocStatus Document::Retrieve(time_t date) // Attempt to retrieve the document pointed to by our internal URL --- 241,246 ---- *************** *** 293,299 **** if (c.assign_server(url->host()) == NOTOK) return Document_no_host; } ! if (c.connect(1) == NOTOK) { if (debug > 1) --- 266,272 ---- if (c.assign_server(url->host()) == NOTOK) return Document_no_host; } ! if (c.connect(1) == NOTOK) { if (debug > 1) *************** *** 303,309 **** return Document_no_server; } - current_connection = &c; // // Construct and send the request to the server --- 276,281 ---- *************** *** 373,395 **** // // Setup a timeout for the connection // ! struct sigaction sa; ! #ifdef _AIX ! sa.sa_handler = (void(*)(int)) timeout; ! #else ! sa.sa_handler = (SIGNAL_HANDLER) timeout; ! #endif ! sigemptyset ((sigset_t *) &sa.sa_mask); ! sigaddset ((sigset_t *) &sa.sa_mask, SIGALRM); ! #if defined(SA_INTERRUPT) ! sa.sa_flags = SA_INTERRUPT; ! #else ! sa.sa_flags = 0; ! #endif ! sigaction(SIGALRM, &sa, 0); ! int timeout_interval = config.Value("timeout"); ! alarm(timeout_interval); ! DocStatus returnStatus; switch (readHeader(c)) { --- 345,352 ---- // // Setup a timeout for the connection // ! c.timeout(config.Value("timeout")); ! DocStatus returnStatus; switch (readHeader(c)) { *************** *** 414,420 **** } if (returnStatus != Document_ok) { - alarm(0); return returnStatus; } --- 371,376 ---- *************** *** 425,432 **** char docBuffer[8192]; int bytesRead; - if (debug < 2) - alarm(timeout_interval); while ((bytesRead = c.read(docBuffer, sizeof(docBuffer))) > 0) { if (debug > 2) --- 381,386 ---- *************** *** 434,444 **** if (contents.length() + bytesRead > max_doc_size) break; contents.append(docBuffer, bytesRead); - if (debug < 2) - alarm(timeout_interval); } c.close(); - alarm(0); document_length = contents.length(); if (debug > 2) --- 388,395 ---- diff -rc htdig-3.0.8b2.orig/htdig/HTML.cc htdig-3.0.8b2/htdig/HTML.cc *** htdig-3.0.8b2.orig/htdig/HTML.cc Fri Aug 15 07:59:26 1997 --- htdig-3.0.8b2/htdig/HTML.cc Sun Aug 17 16:34:35 1997 *************** *** 99,104 **** --- 99,107 ---- int in_space; unsigned char *q, *start; unsigned char *position = (unsigned char *) contents->get(); + unsigned char *text = + (unsigned char *) new char[strlen((char *)position)+1]; + unsigned char *ptext = text; start = position; title = 0; *************** *** 136,182 **** tag = 0; tag.append((char*)position, q - position + 1); do_tag(retriever, tag); ! position = q; } - #if 0 else if (*position == '&') { ! // ! // HTML uses "&;" as a way to put special characters in a ! // document. We'll just skip these... ! // ! unsigned char *orig = position++; ! if (!isalnum(*position)) ! { ! // ! // This is an illegal escape. We need to assume the ! // author just wants to use a '&'. ! // ! position = orig; ! } ! else ! { ! while (isalnum(*position)) ! position++; ! if (*position != ';') ! { ! // ! // Broken escape. Didn't end with a ';'. Assume literal ! // ! position = orig; ! } ! else ! { ! position++; ! continue; ! } ! } } ! #endif ! word = 0; ! if (*position > 0 && (isalnum(*position) || *position >= 160 || ! *position == '&')) { // // Start of a word. Try to find the whole thing --- 139,162 ---- tag = 0; tag.append((char*)position, q - position + 1); do_tag(retriever, tag); ! position = q+1; } else if (*position == '&') { ! *ptext++ = SGMLEntities::translateAndUpdate(position); } ! else ! { ! *ptext++ = *position++; ! } ! } ! *ptext++ = '\0'; ! ! position = text; ! while (*position) ! { word = 0; ! if (*position > 0 && (isalnum(*position))) { // // Start of a word. Try to find the whole thing *************** *** 184,204 **** in_space = 0; while (*position && (isalnum(*position) || ! strchr(valid_punctuation, *position) || ! *position >= 160 || ! *position == '&')) { ! if (*position == '&') ! { ! unsigned char ch; ! ch = SGMLEntities::translateAndUpdate(position); ! word << (char) ch; ! } ! else ! { ! word << (char)*position; ! position++; ! } } if (in_title && doindex) --- 164,173 ---- in_space = 0; while (*position && (isalnum(*position) || ! strchr(valid_punctuation, *position))) { ! word << (char)*position; ! position++; } if (in_title && doindex) *************** *** 251,257 **** // // Characters that are not part of a word // ! if (*position != '>' && doindex) { if (isspace(*position)) { --- 220,226 ---- // // Characters that are not part of a word // ! if (doindex) { if (isspace(*position)) { *************** *** 299,304 **** --- 268,275 ---- } } retriever.got_head(head); + + delete text; } diff -rc htdig-3.0.8b2.orig/htdoc/attrs.html htdig-3.0.8b2/htdoc/attrs.html *** htdig-3.0.8b2.orig/htdoc/attrs.html Fri Aug 15 07:59:30 1997 --- htdig-3.0.8b2/htdoc/attrs.html Sun Aug 17 18:27:33 1997 *************** *** 994,999 **** --- 994,1027 ----
+
matches_per_page_list +
+
+
+
type:
+
+ integer list +
+
used by:
+
+ htsearch +
+
default:
+
+ 10 25 50 100 250 500 +
+
description:
+
+ The values to use in the menu of matches per page made in the + MATCH_LIST. +
+
example:
+
+ matches_per_page_list: 10 20 50 +
+
+
+

max_description_length
diff -rc htdig-3.0.8b2.orig/htdoc/hts_templates.html htdig-3.0.8b2/htdoc/hts_templates.html *** htdig-3.0.8b2.orig/htdoc/hts_templates.html Fri Aug 15 07:59:31 1997 --- htdig-3.0.8b2/htdoc/hts_templates.html Sun Aug 17 18:22:52 1997 *************** *** 9,15 ****

ht://Dig © 1995, 1996, 1997 Andrew Scherpbier andrew@contigo.com
Please ! see the file COPYING for license information.


--- 9,15 ----

ht://Dig © 1995, 1996, 1997 Andrew Scherpbier andrew@contigo.com
Please ! see the file COPYING for license information.


*************** *** 65,72 ****

CGI !
This expands to whatever the SCRIP_NAME environment variable is.
DESCRIPTIONS
A list of descriptions for the matched document. The entries in the list are separated by <br>. --- 65,74 ----

CGI !
This expands to whatever the SCRIPT_NAME environment variable is. +
CURRENT +
The number of the current match
DESCRIPTIONS
A list of descriptions for the matched document. The entries in the list are separated by <br>. *************** *** 86,91 **** --- 88,97 ----
LOGICAL_WORDS
A string of the search words with either "and" or "or" between the words, depending on the type of search. +
MATCH_LIST +
Expands to an HTML menu of all the configured number of + matches per page. + The current number will be the default one.
MATCH_MESSAGE
This is either all or some depending on the match method used. diff -rc htdig-3.0.8b2.orig/htlib/Connection.cc htdig-3.0.8b2/htlib/Connection.cc *** htdig-3.0.8b2.orig/htlib/Connection.cc Fri Aug 15 07:59:35 1997 --- htdig-3.0.8b2/htlib/Connection.cc Sun Aug 17 15:22:37 1997 *************** *** 40,45 **** --- 40,46 ---- #include #include #include + #include extern "C" { int rresvport(int *); *************** *** 54,59 **** --- 55,61 ---- peer = 0; server_name = 0; all_connections.Add(this); + timeout_value = 0; } *************** *** 80,85 **** --- 82,88 ---- peer = 0; server_name = 0; all_connections.Add(this); + timeout_value = 0; } *************** *** 139,144 **** --- 142,158 ---- //***************************************************************************** + // int Connection::timeout(int value) + // + int Connection::timeout(int value) + { + int oval = timeout_value; + timeout_value = value; + return oval; + } + + + //***************************************************************************** // int Connection::close() // int Connection::close() *************** *** 380,388 **** { int count; do { ! count = ::read(sock, buffer, maxlength); } while (count < 0 && errno == EINTR && !need_io_stop); need_io_stop = 0; --- 394,422 ---- { int count; + need_io_stop = 0; do { ! errno = 0; ! ! if (timeout_value > 0) { ! fd_set fds; ! FD_ZERO(&fds); ! FD_SET(sock, &fds); ! ! timeval tv; ! tv.tv_sec = timeout_value; ! tv.tv_usec = 0; ! ! int selected = ::select(sock+1, &fds, 0, 0, &tv); ! if (selected <= 0) ! need_io_stop++; ! } ! ! if (!need_io_stop) ! count = ::read(sock, buffer, maxlength); ! else ! count = -1; // Input timed out } while (count < 0 && errno == EINTR && !need_io_stop); need_io_stop = 0; diff -rc htdig-3.0.8b2.orig/htlib/Connection.h htdig-3.0.8b2/htlib/Connection.h *** htdig-3.0.8b2.orig/htlib/Connection.h Fri Aug 15 07:59:35 1997 --- htdig-3.0.8b2/htlib/Connection.h Sun Aug 17 14:57:14 1997 *************** *** 42,47 **** --- 42,48 ---- int close(); int ndelay(); int nondelay(); + int timeout(int value); // Port stuff int assign_port(int port = 0); *************** *** 85,90 **** --- 86,92 ---- char *peer; char *server_name; int need_io_stop; + int timeout_value; }; diff -rc htdig-3.0.8b2.orig/htsearch/Display.cc htdig-3.0.8b2/htsearch/Display.cc *** htdig-3.0.8b2.orig/htsearch/Display.cc Fri Aug 15 07:59:44 1997 --- htdig-3.0.8b2/htsearch/Display.cc Sun Aug 17 17:55:08 1997 *************** *** 114,120 **** continue; // The document isn't present for some reason ref->DocAnchor(match->getAnchor()); ref->DocScore(match->getScore()); ! displayMatch(match); numberDisplayed++; } currentMatch++; --- 114,120 ---- continue; // The document isn't present for some reason ref->DocAnchor(match->getAnchor()); ref->DocScore(match->getScore()); ! displayMatch(match, currentMatch+1); numberDisplayed++; } currentMatch++; *************** *** 147,153 **** //***************************************************************************** void ! Display::displayMatch(ResultMatch *match) { String *str; --- 147,153 ---- //***************************************************************************** void ! Display::displayMatch(ResultMatch *match, int current) { String *str; *************** *** 159,167 **** } vars.Add("URL", new String(match->getURL())); vars.Add("SCORE", new String(form("%d", match->getScore()))); char *title = ref->DocTitle(); if (!title || !*title) ! title = "[No title]"; vars.Add("TITLE", new String(title)); vars.Add("STARSRIGHT", generateStars(ref, 1)); vars.Add("STARSLEFT", generateStars(ref, 0)); --- 159,169 ---- } vars.Add("URL", new String(match->getURL())); vars.Add("SCORE", new String(form("%d", match->getScore()))); + vars.Add("CURRENT", new String(form("%d", current))); + char *title = ref->DocTitle(); if (!title || !*title) ! title = match->getURL(); vars.Add("TITLE", new String(title)); vars.Add("STARSRIGHT", generateStars(ref, 1)); vars.Add("STARSLEFT", generateStars(ref, 0)); *************** *** 282,288 **** } *str << "\n"; vars.Add("METHOD", str); ! // // If a paged output is required, set the appropriate variables // --- 284,303 ---- } *str << "\n"; vars.Add("METHOD", str); ! ! str = new String(); ! QuotedStringList mppl(config["matches_per_page_list"], " \t\r\n"); ! *str << "\n"; ! vars.Add("MATCH_LIST", str); ! // // If a paged output is required, set the appropriate variables // diff -rc htdig-3.0.8b2.orig/htsearch/Display.h htdig-3.0.8b2/htsearch/Display.h *** htdig-3.0.8b2.orig/htsearch/Display.h Fri Aug 15 07:59:44 1997 --- htdig-3.0.8b2/htsearch/Display.h Sun Aug 17 17:51:23 1997 *************** *** 46,52 **** void setCGI(cgi *); void display(int pageNumber); ! void displayMatch(ResultMatch *); void displayHeader(); void displayFooter(); void displayNomatch(); --- 46,52 ---- void setCGI(cgi *); void display(int pageNumber); ! void displayMatch(ResultMatch *, int); void displayHeader(); void displayFooter(); void displayNomatch();