
From: Gilles Detillieux <grdetil@scrc.umanitoba.ca>
To: htdig@htdig.org
Subject: [htdig] Patch

Suffix-handling improvement (PR#560), to prevent inappropriate suffix
stripping in endings fuzzy matches.

> From: Steve Arlow <yorick@ClarkHill.com>
> Subject: Suffix-handling improvement
> To: htdig3-bugs@htdig.org
> Date: Tue, 8 Jun 1999 19:57:54 -0400 (EDT)
> Cc: yorick@yorick.com
> 
> Hello,
> 
> I do consulting for a number of law firms, and quickly discovered a
> problem with htfuzzy matching on the word "witness".  (There are 
> three root words in the distribution dictionary that end in "-ness"
> and also certainly exhibit this problem; the other two are
> "highness" and "likeness".  Other words can also be argued about.)
> 
> The fix (which does not appear to break anything else AFAICT, but 
> may have a small effect on performance) is to add a preliminary check
> on root2word before trying word2root.  The code is below (from the
> file htdig-3.1.2/htfuzzy/Endings.cc), optimize it to your taste.

Follow-up example:
> Words of the form XXXness which are not a form of the word XXX.  If I
> enter "witness" into htdig with matching for alternate endings enabled,
> it will look for "wit", "wits", or "witness".  What it should really be
> looking for is "witness", "witnessed", "witnessing", or "witnesses".
> 
> A similar problem might occur with other suffixes, but I can't think of
> an example off the top of my head.
> 
> The fix is to try to interpret each term as a root word before trying
> to interpret it as an alternate form.

--- htdig-3.1.2/htfuzzy/Endings.cc.endingsbug	Wed Apr 21 21:47:58 1999
+++ htdig-3.1.2/htfuzzy/Endings.cc	Fri Jul 30 14:43:57 1999
@@ -68,22 +68,6 @@ Endings::getWords(char *w, List &words)
     String	word = w;
     word.lowercase();
 	
-    if (word2root->Get(word, data) == OK)
-    {
-	//
-	// Found the root of the word.  We'll add it to the list already
-	//
-	word = data;
-	words.Add(new String(word));
-    }
-    else
-    {
-	//
-	// The root wasn't found.  This could mean that the word
-	// is already the root.
-	//
-    }
-
     if (root2word->Get(word, data) == OK)
     {
 	//
@@ -97,6 +81,40 @@ Endings::getWords(char *w, List &words)
 		words.Add(new String(token));
 	    }
 	    token = strtok(0, " ");
+	}
+    }
+    else
+    {
+	if (word2root->Get(word, data) == OK)
+	{
+	    //
+	    // Found the root of the word.  We'll add it to the list already
+	    //
+	    word = data;
+	    words.Add(new String(word));
+	}
+	else
+	{
+	    //
+	    // The root wasn't found.  This could mean that the word
+	    // is already the root.
+	    //
+	}
+
+	if (root2word->Get(word, data) == OK)
+	{
+	    //
+	    // Found the root's permutations
+	    //
+	    char	*token = strtok(data.get(), " ");
+	    while (token)
+	    {
+		if (mystrcasecmp(token, w) != 0)
+		{
+		    words.Add(new String(token));
+		}
+		token = strtok(0, " ");
+	    }
 	}
     }
 }

