Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-3039

Failure to handle ftp:// URLs

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.19
    • 1.21
    • plugin, protocol
    • None
    • Patch Available

    Description

      Nutch fails to handle ftp:// URLs:

      • URLNormalizerBasic returns the empty string because creating the URL instance fails with a MalformedURLException:
        echo "ftp://ftp.example.com/path/file.txt" \
          | nutch normalizerchecker -stdin -normalizer urlnormalizer-basic
      • fetching a ftp:// URL with the protocol-ftp plugin enabled also fails due to a MalformedURLException:
        bin/nutch parsechecker -Dplugin.includes='protocol-ftp|parse-tika' \
           "ftp://ftp.example.com/path/file.txt"
        ...
        Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: java.net.MalformedURLException
                at org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113)
        ...

      The issue is caused by NUTCH-2429:

      • we do not provide a dedicated URL stream handler for ftp URLs
      • but also do not pass ftp:// URLs to the standard JVM handler

      Attachments

        Issue Links

          Activity

            People

              snagel Sebastian Nagel
              snagel Sebastian Nagel
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: