Affects Version/s: 0.8, 0.8.1, 0.9.0
Fix Version/s: None
MacOS X and Linux (CentOS 4.5) both
Patch Info:Patch Available
I'm using 0.81, but this will affect all other versions as well.
Relative links of the form "?blah" are resolved incorrectly. For example, with a base URL of http://www.fleurie.org/entreprise.asp, and a relative link of "?id_entrep=111", Nutch will resolve this pair to the link
"http://www.fleurie.org/?id_entrep=111". No such URL exists, and all browsers I tried will resolve the pair to "http://www.fleurie.org/entreprise.asp?id_entrep=111".
I tracked this down to what could be called a bug in Sun's URL class. According to Sun's spec, they parse the relative URL according to RFC 2396. But the original RFC for relative links was RFC 1808, and the two RFCs differ in how they handle relative links beginning with "?". Most browsers (Netscape/Mozilla, IE, Safari) implemented RFC 1808, and stuck with it (for compatibility and also because the behavior makes more sense). Apparently even the people that wrote RFC 2396 recognized that this was a mistake, and the specified behavior was changed in RFC 3986 to match what browsers do.
For a discussion of this, see http://gbiv.com/protocols/uri/rev-2002/issues.html#003-relative-query
Sun's URL implementation, however, still implements RFC2396, as far as I can tell, and is out of step with the rest of the world.
This breaks link extraction on a number of sites.
I put it in the org.apache.nutch.net directory, but obviously feel free to move it to another place if you feel it belongs there!