Details
Description
Nutch fails to handle ftp:// URLs:
- URLNormalizerBasic returns the empty string because creating the URL instance fails with a MalformedURLException:
echo "ftp://ftp.example.com/path/file.txt" \ | nutch normalizerchecker -stdin -normalizer urlnormalizer-basic
- fetching a ftp:// URL with the protocol-ftp plugin enabled also fails due to a MalformedURLException:
bin/nutch parsechecker -Dplugin.includes='protocol-ftp|parse-tika' \ "ftp://ftp.example.com/path/file.txt" ... Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: java.net.MalformedURLException at org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113) ...
The issue is caused by NUTCH-2429:
- we do not provide a dedicated URL stream handler for ftp URLs
- but also do not pass ftp:// URLs to the standard JVM handler
Attachments
Issue Links
- links to