
From grdetil@scrc.umanitoba.ca Thu Oct 14 14:46:54 1999
Date: Thu, 14 Oct 1999 15:52:18 -0500 (CDT)
From: Gilles Detillieux <grdetil@scrc.umanitoba.ca>
To: htdig@htdig.org
Cc: htdig@htdig.org
Subject: Re: [htdig] URLs parsing problem


According to Benjelloun Adnane:
> When I execute "rundig" it doesn't parse Hyperlinks correctly
> 
> This URL :
> index.epl?menu=menug&ances=root_1000&selec=1000&href=file1.html
> 
> Is parsed as :
> 
> index.epl?menu=menug=root_1000=1000=file1.htm
> 
> 
> Please msg. me at (abenjell@crim.ca) if you have any help

Yes, this is a known problem with the 3.1.3 release.  It turns out that
the fix for handling &foo; SGML entities in HTML tag parameters had a
couple problems.  First of all, it didn't handle &amp; if translate_amp
was false, even though this was one of the main motivations for the fix.
Secondly, it messed up bare &'s in tag parameters.

Torsten posted a patch for this a few days after 3.1.3 was released.
I'm now posting what I think is an improvement on this.  Torsten's
patch had a couple potential problems which I think this patch avoids.
(The most serious of these was that a parameter after a bare "&" could
still get stripped out, if translate_amp was true.)  Please give this
patch a try and let me know if there are any problems.

--- htdig/HTML.cc.orig	Wed Sep 22 11:18:40 1999
+++ htdig/HTML.cc	Thu Oct 14 15:08:24 1999
@@ -1114,7 +1114,15 @@ HTML::transSGML(char *str)
     while (*text)
     {
 	if (*text == '&')
-	    convert << SGMLEntities::translateAndUpdate(text);
+	{
+	    if (strncmp((char *)text, "&amp;", 5) == 0) 
+	    {
+		// We MUST convert these in URLs, regardless of translate_amp.
+		convert << '&';
+		text += 5;
+	    } else
+		convert << SGMLEntities::translateAndUpdate(text);
+	}
 	else
 	    convert << *text++;
     }
--- htdig/SGMLEntities.cc.orig	Wed Sep 22 11:18:41 1999
+++ htdig/SGMLEntities.cc	Thu Oct 14 15:08:31 1999
@@ -280,5 +280,11 @@ SGMLEntities::translateAndUpdate(unsigne
     
     if (*entityStart == ';')
 	entityStart++;		// A final ';' is used up.
-    return translate(entity);
+    unsigned char e = translate(entity);
+    if (e == ' ' && strncmp((char *)orig, "&#32", 4) != 0)
+    {
+	entityStart = orig + 1;	// Catch unrecognized entities...
+	return '&';
+    }
+    return e;
 }


-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word unsubscribe in
the SUBJECT of the message.
