Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
The parserchecker (org.apache.nutch.parse.ParserChecker) calls URLUtil.toASCII on the url it is given, reencoding already percent-encoded URLs.
For instance, let's say we want to query http://example.com, passing a GET parameter with name 'q' and value '/'. '/' is a special character, and thus has to be encoded before being sent.
If we pass 'http://example.com/?q=/' to the parserchecker, then it doesn't encode the '/', and tries to fetch the URL as is, which is invalid.
If we try to encode the parameter beforehand, and call the parsechecker with 'http://example.com/?q=%2F', then it encodes the '%' sign to '%25', and thus fetches 'http://example.com/?q=%252F'.
This actually makes it impossible to fetch the correct URL (http://example.com/?q=%2F) from the parserchecker.
Attachments
Issue Links
- is part of
-
NUTCH-2012 Merge parsechecker and indexchecker
- Closed