Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2451

protocol-ftp to resolve relative URL when following redirects

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.13
    • 2.4, 1.14
    • protocol
    • None
    • Ubuntu 16.04.3 LTS
      OpenJDK 1.8.0_131
      nutch 1.14-SNAPSHOT
      Synology RS816

    Description

      I tried running Nutch on my Synology NAS. As SMB protocol is not contained in Nutch, I turned on FTP service on the NAS and configured Nutch to crawl ftp://nas.
      The experience gives me varying results which seem to point to problems within Nutch. However this may need further evaluation.

      As some files could not be downloaded and I could not see a good error message I changed the method org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not only return protocol status but send the full exception and stack trace to the logs:

      {{ } catch (Exception e) {
      LOG.warn("Could not get {}", url, e);
      return new ProtocolOutput(null, new ProtocolStatus(e));
      }
      }}
      With this modification I suddenly see such messages in the logfile:
      {{2017-10-25 22:09:31,865 TRACE org.apache.nutch.protocol.ftp.Ftp - fetching ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so
      2017-10-25 22:09:32,147 WARN org.apache.nutch.protocol.ftp.Ftp - Could not get ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so
      java.net.MalformedURLException
      at java.net.URL.<init>(URL.java:627)
      at java.net.URL.<init>(URL.java:490)
      at java.net.URL.<init>(URL.java:439)
      at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:145)
      at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
      Caused by: java.lang.NullPointerException
      }}

      Please mind the URL was not configured from me. Instead it was obtained by crawling my NAS. Also the URL looks perfectly fine to me. Even if the file did not exist I would not expect a MalformedURLException to occur. Even more, using Firefox and the same authentication data on the same URL retrieves the file successfully.

      How come Nutch cannot get the file?

      Attachments

        Issue Links

          Activity

            People

              snagel Sebastian Nagel
              hiranchaudhuri Hiran Chaudhuri
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: