
From wjones@tc.fluke.com Tue Jan 11 22:26:18 2000
Date: Tue, 11 Jan 2000 16:04:33 -0800
From: Warren Jones <wjones@tc.fluke.com>
To: htdig3-dev@htdig.org
Subject: [htdig3-dev] Fixes for valid_extensions

I was very happy to find that the "valid_extensions" option has
been added in version 3.1.4 -- something like this is essential
given the rather chaotic nature of the web server that I have
to index.  But I found that a couple changes were necessary
to make valid_extensions work the way I wanted it to.

If "valid_extensions" are defined, I'd like to retrieve URL's
without extensions *if_and_only_if* they represent a directory.
However, I found that all URL's without extensions are rejected
if the URL contains a fully qualified domain name, e.g.:

     http://www.foo.com/bar/

Retriever::IsValidURL() rejects this URL because it thinks
the extension is:

     .com/bar/

The patch for Retriever.cc (included below) fixes this.

To insure that a URL without an extension will be retrieved
only if it's a directory, I modified URL::normalize() so that
a slash is appended to any URL that doesn't have an extension.
This guarantees that retrieval will fail if the URL is not
a directory.  This works for me, but I'm not sure that it's
the best solution -- comments would be appreciated.

-- 
Warren Jones
Fluke Corporation

---------------------------- snip snip ----------------------------

Index: Retriever.cc
===================================================================
RCS file: /home/wjones/src/CVS.repo/htdig/htdig/Retriever.cc,v
retrieving revision 1.1.1.5
diff -c -r1.1.1.5 Retriever.cc
*** Retriever.cc	1999/12/15 22:06:09	1.1.1.5
--- Retriever.cc	2000/01/11 00:28:29
***************
*** 702,707 ****
--- 702,709 ----
      //
      char	*ext = strrchr(url, '.');
      String	lowerext;
+     if ( ext && strchr(ext,'/') )	// Ignore a dot if it's not in the
+     	ext = NULL;			// final component of the path.
      if (ext)
        {
  	lowerext = ext;

Index: URL.cc
===================================================================
RCS file: /home/wjones/src/CVS.repo/htdig/htlib/URL.cc,v
retrieving revision 1.1.1.5
diff -c -r1.1.1.5 URL.cc
*** URL.cc	1999/12/15 22:06:35	1.1.1.5
--- URL.cc	2000/01/11 23:09:26
***************
*** 469,474 ****
--- 469,490 ----
  
      removeIndex(_path);
  
+     if ( *config["valid_extensions"] != '\0' )
+     { 
+ 	// If we're only accepting valid extensions, then append
+ 	// a trailing slash to any URL without an extension.
+ 	// This insures that the only URL's without extensions
+ 	// we retrieve will be directories.
+ 
+ 	char *slash = strrchr( _path, '/' );
+ 	if ( ! slash || slash[1] != '\0' )
+ 	{
+ 	    char *dot = strrchr( _path, '.' );
+ 	    if ( dot <= slash )
+ 		_path << "/";
+         }
+     }
+ 
      //
      // Convert a hostname to an IP address
      //

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev-unsubscribe@htdig.org 
You will receive a message to confirm this. 


