
From grdetil@scrc.umanitoba.ca Wed Dec  8 10:37:04 1999
Date: Wed, 8 Dec 1999 12:02:35 -0600 (CST)
From: Gilles Detillieux <grdetil@scrc.umanitoba.ca>
To: "Joe R. Jah" <jjah@cloud.ccsf.cc.ca.us>
Cc: grdetil@scrc.umanitoba.ca, ghutchis@wso.williams.edu,
    htdig3-dev@htdig.org
Subject: Re: [htdig3-dev] Re: htdig-3.1.4 prerelease

According to Joe R. Jah:
> On Wed, 8 Dec 1999, Gilles Detillieux wrote:
> > Is it possible that you were getting the extra .shtml/ stuff, but just
> > weren't detecting it in your searches, or are you sure they never came up?
> 
> For that particular keyword search I am sure they never came up.

No, what I meant was are you sure they never came up at all while
running htdig 3.1.3?  If your only test for what documents were indexed
is a few htsearch commands, then it's not an exhaustive test of what's
been indexed.  The implication I was responding to was that 3.1.3 didn't
index these
.shtml/ documents, but given what you've told me so far, I suspect that's
not the case.

I do find it interesting that the .shtml/ problem on your site didn't
lead to an infinite hierarchy of bad URLs, as a few other users had
reported previously when running into this SSI problem.

> > Where does the word appear in these 19 extra documents?  If it's in img
> > alt text, or immediately after a bare ampersand (&), that would explain
>   ^^^^^^^^
> > why htdig 3.1.3 or earlier failed to index that word in these documents.
> > If it appears elsewhere, I'd be very curious to know why htdig 3.1.3
> > missed it, and if it doesn't appear anywhere in the document or in
> > descriptions of hyperlinks to the documents, I'd like to know why htdig
> > 3.1.4 is putting it in the index.  Please look into this further, if you
> > can, and get back to me ASAP.  We'd like to release 3.1.4 tomorrow, but
> > not if it's putting incorrect entries in the index.
> 
> Bingo;)  They were all in img alt text, in that particular search.

Pheww!  Not that I was that worried.

> > Wow, patches to 3.1.4 before it's even released!  :)
> 
> Yes, and I'd love to add another one to it, the max_keywords attribute I
> requested a month or so ago;)

Well, now you're just getting demanding, aren't you?  ;-)  I did give it
a vote for 3.2.0b1 back on October 12, but said I'm not volunteering for
the job.  OK, so now I am...  (Maybe someone else can adapt and document
it for 3.2.0b1?)  And I'm sure Joe will volunteer to test it.  :-)

This undocumented and untested patch adds the max_keywords attribute to
htdig, to index only as many keywords in meta tags, per document, as is
specified in the attribute value.  A value of 0 means no limit.  This
helps combat meta keyword spamming, but still leaves the problem that
the first n spam keywords in a document still get indexed, so searches
for these words will still pull up the spamming documents.

--- htcommon/defaults.cc.orig	Mon Dec  6 16:14:04 1999
+++ htcommon/defaults.cc	Wed Dec  8 11:36:27 1999
@@ -88,6 +88,7 @@ ConfigDefaults	defaults[] =
     {"max_doc_size",			"100000"},
     {"max_head_length",			"512"},
     {"max_hop_count",			"999999"},
+    {"max_keywords",			"0"},
     {"max_meta_description_length",     "512"},
     {"max_prefix_matches",		"1000"},
     {"max_stars",			"4"},
--- htdig/HTML.cc.orig	Fri Dec  3 11:03:04 1999
+++ htdig/HTML.cc	Wed Dec  8 11:44:54 1999
@@ -27,6 +27,8 @@ static StringMatch	attrs;
 static StringMatch	srcMatch;
 static StringMatch	hrefMatch;
 static StringMatch	keywordsMatch;
+static int		keywordsCount;
+static int		max_keywords;
 static int		offset;
 static int		totlength;
 
@@ -98,6 +100,9 @@ HTML::HTML()
     keywordsMatch.IgnoreCase();
     keywordsMatch.Pattern(keywordNames.Join('|'));
     keywordNames.Release();
+    max_keywords = config.Value("max_keywords", 0);
+    if (max_keywords == 0)
+	max_keywords = (int) ((unsigned int) ~1 >> 1);
     
     word = 0;
     href = 0;
@@ -150,6 +155,7 @@ HTML::parse(Retriever &retriever, URL &b
     static char         *skip_start = config["noindex_start"];
     static char         *skip_end = config["noindex_end"];
 
+    keywordsCount = 0;
     offset = 0;
     title = 0;
     head = 0;
@@ -792,7 +798,8 @@ HTML::do_tag(Retriever &retriever, Strin
 		char	*w = HtWordToken(transSGML(keywords));
 		while (w && doindex)
 		{
-		    if (strlen(w) >= minimumWordLength)
+		    if (strlen(w) >= minimumWordLength
+				&& ++keywordsCount <= max_keywords)
 		      retriever.got_word(w, 1, 10);
 		    w = HtWordToken(0);
 		}
@@ -875,7 +882,8 @@ HTML::do_tag(Retriever &retriever, Strin
 		    char	*w = HtWordToken(transSGML(conf["content"]));
 		    while (w && doindex)
 		    {
-			if (strlen(w) >= minimumWordLength)
+			if (strlen(w) >= minimumWordLength
+				&& ++keywordsCount <= max_keywords)
 			  retriever.got_word(w, 1, 10);
 			w = HtWordToken(0);
 		    }

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
