Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2541

Non-ASCII characters in the URL path are not properly escaped by the protocol-httpclient plugin

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.3.1, 1.14
    • Fix Version/s: 1.19
    • Component/s: plugin, protocol
    • Labels:
      None

      Description

      As reported on [1]

      When trying to crawl some URLs with Arabic characters Nutch will complain due to an InvalidArgumentException. This happens because the HTTP client library is using internally the java.net.URI which does not support this characters unless they're properly escaped.

      [1] https://stackoverflow.com/questions/49379007/apache-nutch-2-3-1-fetcher-giving-invalid-uri-exception/49395225?noredirect=1#comment85798974_49395225

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              jorgelbg Jorge Luis Betancourt Gonzalez
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: