Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1767

remove special treatment of "params" in relative links

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.8, 2.2.1
    • Fix Version/s: 2.3, 1.9
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      RFC 1808 specified that path elements of URLs may contains so-called params startet by ";", e.g. ";type=a". If the base URL contains a path param while the link target does not, params are transferred to the target:

      Step 5:
      a) if the embedded URL's <params> is non-empty, we skip to
      step 7; otherwise, it inherits the <params> of the base URL (if any)

      This behaviour has been implemented with NUTCH-436. Later (NUTCH-1115) it had been made optional and configurable by property parser.fix.embeddedparams. NUTCH-797 made the changes of both issues inactive for 1.x (not applied to 2.x) with reference to RFC 3986.

      RFC 3986 which obsoletes RFC 1808 does not mention params and examples given in sect. 5.4. "Reference Resolution Examples" contradict RFC 1808. Also Wikipedia states:

      Historically, each segment was specified to contain parameters separated from it using a semicolon (";"), though this was rarely used in practice and current specifications allow but no longer specify such semantics.

      Accordingly, any special treatment of "params" in relative links should be removed from Nutch. At a first glance, this would include:

      • 2.x parse-html and parse-tika
        • remove fixEmbeddedParams(...)
        • change unit tests to follow examples from RFC 3986
      • 1.x
        • remove unused fixEmbeddedParams(...) from parse-html
        • remove property parser.fix.embeddedparams from nutch-default.xml

        Attachments

        1. test_nutch_1767-1.html
          0.2 kB
          Sebastian Nagel
        2. test_nutch_1767-2.html
          0.3 kB
          Sebastian Nagel
        3. NUTCH-1767-1x.patch
          4 kB
          Sebastian Nagel
        4. NUTCH-1767-2x.patch
          9 kB
          Sebastian Nagel

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              snagel Sebastian Nagel
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: