Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2547

urlnormalizer-basic fails on special characters in path/query

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.14
    • 1.15
    • plugin
    • None

    Description

      If a URL contains one of the characters |"<>^` or a single % (not followed by a 2-characther hex-value), BasicURLNormalizer fails to normalize the URL path (here: remove /c/..):

      % for c in "" $(echo '|%"^<>`' | grep -o .); do
          echo "http://www.example.com/a/c/../b/search?q=foobar$c"
        done \
        | nutch normalizerchecker -normalizer urlnormalizer-basic -stdin
      Checking combination of these URLNormalizers: BasicURLNormalizer 
      http://www.example.com/a/b/search?q=foobar
      http://www.example.com/a/c/../b/search?q=foobar|
      http://www.example.com/a/c/../b/search?q=foobar%
      http://www.example.com/a/c/../b/search?q=foobar"
      http://www.example.com/a/c/../b/search?q=foobar^
      http://www.example.com/a/c/../b/search?q=foobar<
      http://www.example.com/a/c/../b/search?q=foobar>
      http://www.example.com/a/c/../b/search?q=foobar`
      

      The reason is that these characters (should check for more, including control characters) are not valid as part of a URI (cf. RFC3986). BasicURLNormalizer normalizes the path by converting the URL to a URI and calling normalize().

      There are two possible solutions:

      1. do not use java.net.URI
      2. ensure that every URL returned (or used internally) by urlnormalizer-basic is a valid URI (resp. its String representation).

      I would opt for #2 because the class URI is used practically everywhere in Nutch and libraries (e.g. HttpClient). Any thoughts?

      Attachments

        Issue Links

          Activity

            People

              snagel Sebastian Nagel
              snagel Sebastian Nagel
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: