Description
If a URL contains one of the characters |"<>^` or a single % (not followed by a 2-characther hex-value), BasicURLNormalizer fails to normalize the URL path (here: remove /c/..):
% for c in "" $(echo '|%"^<>`' | grep -o .); do echo "http://www.example.com/a/c/../b/search?q=foobar$c" done \ | nutch normalizerchecker -normalizer urlnormalizer-basic -stdin Checking combination of these URLNormalizers: BasicURLNormalizer http://www.example.com/a/b/search?q=foobar http://www.example.com/a/c/../b/search?q=foobar| http://www.example.com/a/c/../b/search?q=foobar% http://www.example.com/a/c/../b/search?q=foobar" http://www.example.com/a/c/../b/search?q=foobar^ http://www.example.com/a/c/../b/search?q=foobar< http://www.example.com/a/c/../b/search?q=foobar> http://www.example.com/a/c/../b/search?q=foobar`
The reason is that these characters (should check for more, including control characters) are not valid as part of a URI (cf. RFC3986). BasicURLNormalizer normalizes the path by converting the URL to a URI and calling normalize().
There are two possible solutions:
- do not use java.net.URI
- ensure that every URL returned (or used internally) by urlnormalizer-basic is a valid URI (resp. its String representation).
I would opt for #2 because the class URI is used practically everywhere in Nutch and libraries (e.g. HttpClient). Any thoughts?
Attachments
Issue Links
- links to