Date: Tue, 30 Mar 2004 12:20:41 -0600 (CST) From: Gilles Detillieux To: "ht://Dig mailing list" Subject: Re: [htdig] query parameters should be ignored by extension filter? - PATCH for 3.2.0b5 According to me: > Last week, I wrote: > > According to David Adams: > > > I am also using ht://Dig version 3.1.6 and for me it IS indexing URLs like > > > > > > http://www.soton.ac.uk/~lopsoc/gallery.php?gallery=sorcerer1&photo=CNV00023.jpg > > > > > > even though I have .jpg in my bad_extensions: list. > > > > Actually, I find this surprising. Upon looking at the code that handles > > bad_extensions, in both 3.1.6 and 3.2.0b5, it seems to me that there is > > indeed a bug in the way htdig locates filename extensions in URLs, as > > Toby described. Can you confirm that you're running vanilla 3.1.6 with > > no patches to htdig/Retriever.cc which might correct this bug? > > > > The fix to the code should be pretty simple, but I haven't had the time > > to sit down and stare at it long enough to get the fix coded yet. I'll > > try to get around to it by Friday, so it'll be in the next development > > snapshot for the 3.2 betas, and posted to the list. > > OK, last week got a bit crazy, so I wrote the patch yesterday afternoon, > just before the end of my work day. Here it is. Apply it in your main > 3.1.6 source directory using "patch -p0 < this-message-file". Please > let me know if it solves the problem for you and/or causes others. I've > made sure the code compiles with the patch, but haven't tested it beyond > that. Thanks. And this is the same patch for the 3.2.0b5 version, in case anyone wants to give it a shot. Same story. --- htdig/Retriever.cc.orig 2003-10-23 12:40:20.000000000 -0500 +++ htdig/Retriever.cc 2004-03-29 17:47:25.000000000 -0600 @@ -1023,16 +1023,17 @@ int Retriever::IsValidURL(const String & // // See if the file extension is in the list of invalid ones // - ext = strrchr((char *) url, '.'); + String urlpath = url.get(); + int parm = urlpath.indexOf('?'); // chop off URL parameter + if (parm >= 0) + urlpath.chop(urlpath.length() - parm); + ext = strrchr((char *) urlpath.get(), '.'); String lowerext; if (ext && strchr(ext, '/')) // Ignore a dot if it's not in the ext = NULL; // final component of the path. if (ext) { lowerext.set(ext); - int parm = lowerext.indexOf('?'); // chop off URL parameter - if (parm >= 0) - lowerext.chop(lowerext.length() - parm); lowerext.lowercase(); if (invalids.Exists(lowerext)) { -- Gilles R. Detillieux E-mail: Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada)