
From grdetil@scrc.umanitoba.ca Tue Dec  7 14:18:30 1999
Date: Tue, 7 Dec 1999 16:00:18 -0600 (CST)
From: Gilles Detillieux <grdetil@scrc.umanitoba.ca>
To: jjah@cloud.ccsf.cc.ca.us
Cc: ghutchis@wso.williams.edu, htdig3-dev@htdig.org
Subject: Re: [htdig3-dev] Re: htdig-3.1.4 prerelease

According to Joe R. Jah:
> I downloaded and installed it on a BSDI 4.0 box; it compiled but, htsearch
> dumped core.  I followed the old BSDI/htdig fix:
...
> everything worked except my the old local duplicate suppressor patch:
> ftp://sol.ccsf.cc.ca.us/htdig-patches/3.0.8b2/Retriever.cc.0
> did not quite do its job.
...
> As you see database sizes do not vary too much, but the results pages
> point to the same URL MULTIPLE times in 3.1.4 case; baffling;-/?

I tried to apply this patch to the 3.1.4 prerelease just now, and it failed
entirely.  Did you apply it manually?  Did you change the IsLocal() call in
the patch to GetLocal() instead, as is needed by the new Retriever code?
Did you run htdig with -v to see if the patched in duplicate suppression
code was actually being activated?

Different database sizes could be due to the fact that 3.1.4 indexes img
alt text, and doesn't clobber words immediately following bare ampersands.
I can't imagine why you'd see the exact same URL multiple times, but it
may be that in manually applying the patch to Need2Get, you broke the
function.

Here's a 3.1.4 adaptation of this old patch, completely untested of course,
but if you want to give it a shot, please do.  If the old code worked, I
can see no reason why this patch wouldn't.

[Adapted from patch by Warren Jones]
This patch to ht://Dig allows it to reject URLs on a local host that
are links (through the file system) to a URL that has already been
indexed.  This works with the local_urls option in version 3.1.4.
I didn't bother to create another hashtable, but just added a key
based on the file's device and inode numbers to Retriever::visited.
The following patch was made against version 3.1.4.

--- htdig/Retriever.cc.orig	Fri Dec  3 11:58:38 1999
+++ htdig/Retriever.cc	Tue Dec  7 15:41:42 1999
@@ -18,6 +18,8 @@
 #include <signal.h>
 #include <assert.h>
 #include <stdio.h>
+#include <sys/types.h>
+#include <sys/stat.h>
 #include "HtWordType.h"
 
 static WordList	words;
@@ -603,7 +605,37 @@ Retriever::Need2Get(char *u)
     static String	url;
     url = u;
 
-    return !visited.Exists(url);
+    if ( visited.Exists(url) )
+    	return FALSE;
+    	
+    String *local_filename = GetLocal(u);   // For local URL's, check
+    if ( local_filename )		    // list for device and inode
+    {					    // to make sure we haven't
+	struct stat buf;		    // already indexed a link
+					    // to this file.
+
+	if ( stat(local_filename->get(),&buf) == 0 )
+	{
+	    char key[2*sizeof(ino_t)+2*sizeof(dev_t)+2];      // Make hash key
+	    sprintf( key, "%x+%x", buf.st_dev, buf.st_ino );  // from device
+	    if ( visited.Exists(key) )			      // and inode.
+	    {
+		if ( debug ) {
+		    String *dup = (String*)visited.Find(key);
+		    cout << endl
+			 << "Duplicate: " << local_filename->get()
+			 << " -> "        << dup->get() << endl;
+		}
+		delete local_filename;
+		return FALSE;
+	    }
+	    visited.Add(key,local_filename);
+	    return TRUE;
+	}
+	delete local_filename;
+    }
+    return TRUE;
+
 }
 
 


-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
