
From grdetil@scrc.umanitoba.ca Tue Feb  1 08:50:29 2000
Date: Tue, 1 Feb 2000 09:19:30 -0600 (CST)
From: Gilles Detillieux <grdetil@scrc.umanitoba.ca>
To: fxbois@cybercable.fr
Cc: htdig@htdig.org
Subject: [htdig] [PATCH] fix valid_extensions handling bugs

    [The following text is in the "iso-8859-1" character set]
    [Your display is set for the "US-ASCII" character set]
    [Some characters may be displayed incorrectly]

According to fx:
> it s very strange ...
> I show you my conf
> -------------------------------------------------------------
> database_dir:  /home/web/inerd/htdig/db
> database_base:  ${database_dir}/inerd
> #allow_virtual_hosts: true
> valid_extensions: .html .htm .shtml .php .php3 .asp .php
> start_url:  http://192.168.0.2
> limit_urls_to:  http://192.168.0.2
> exclude_urls:  /cgi-bin/ .cgi
> bad_extensions:  .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif\
>    .jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi
> maintainer:  inerd
> max_head_length: 10000
> max_doc_size:  200000
> no_excerpt_show_top: false
> search_algorithm: exact:1 synonyms:0.5 endings:0.1
> search_results_wrapper: /home/web/inerd/www/htdig/wrapper_inerd.html
> nothing_found_file:     /home/web/inerd/www/htdig/nomatch_inerd.html
> ----------------------------------------------------
> the result of the htdig -i -vvv
> 
> ...
>    pushing http://192.168.0.2/index.php3
> +A tag: pos = 2, position = =/news/index.php3?idnews=3 class=news>
> href: http://192.168.0.2/news/index.php3?idnews=3 (La troisième)
> 
>    Rejected: Extension is not valid!

This error, just as the one below, indicates the URL is rejected because
it doesn't fit any of the patterns in valid_extensions.  Unfortunately,
the pattern matching doesn't take CGI parameters into account, so the
match fails.  I think this is a bug, which the patch below should fix.

> ...
> 
> ...
> *A tag: pos = 2, position = ="/services" class="navig1">
> href: http://192.168.0.2/services (services)
> 
>    Rejected: Extension is not valid!

In this case, the URL is rejected because of a bug in the new
valid_extensions attribute handling, as was pointed out by Warren
Jones about a month ago.

> ...
> 
> do you have any suggestion ?
> (I ve really tried a lot of things ... a real mystery)
> 
> thanx
> 
> ps : I use 3.1.4
> and my directory index is good :
> DirectoryIndex index.html index.htm index.shtml index.cgi index.php3

Here is a patch which I hope will fix both problems.  Please let me know
if it works.

--- htdig/Retriever.cc.valextbug	Thu Dec  9 18:28:44 1999
+++ htdig/Retriever.cc	Tue Feb  1 09:16:04 2000
@@ -702,9 +702,14 @@ Retriever::IsValidURL(char *u)
     //
     char	*ext = strrchr(url, '.');
     String	lowerext;
+    if (ext && strchr(ext, '/'))	// Ignore a dot if it's not in the
+      ext = NULL;			// final component of the path.
     if (ext)
       {
 	lowerext = ext;
+	int parm = lowerext.indexOf('?');	// chop off URL parameter
+	if (parm >= 0)
+	    lowerext.chop(lowerext.length() - parm);
 	lowerext.lowercase();
 	if (invalids->Exists(lowerext))
 	  {

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.
