RFC 1808 specified that path elements of URLs may contains so-called params startet by ";", e.g. ";type=a". If the base URL contains a path param while the link target does not, params are transferred to the target:
a) if the embedded URL's <params> is non-empty, we skip to
step 7; otherwise, it inherits the <params> of the base URL (if any)
This behaviour has been implemented with
NUTCH-436. Later ( NUTCH-1115) it had been made optional and configurable by property parser.fix.embeddedparams. NUTCH-797 made the changes of both issues inactive for 1.x (not applied to 2.x) with reference to RFC 3986.
Historically, each segment was specified to contain parameters separated from it using a semicolon (";"), though this was rarely used in practice and current specifications allow but no longer specify such semantics.
Accordingly, any special treatment of "params" in relative links should be removed from Nutch. At a first glance, this would include:
- 2.x parse-html and parse-tika
- remove fixEmbeddedParams(...)
- change unit tests to follow examples from RFC 3986
- remove unused fixEmbeddedParams(...) from parse-html
- remove property parser.fix.embeddedparams from nutch-default.xml