Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2459

Nutch cannot download/parse some files via FTP

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.13
    • Fix Version/s: 1.19
    • Component/s: protocol
    • Labels:
      None
    • Environment:

      Ubuntu 16.04.3 LTS
      OpenJDK 1.8.0_131
      nutch 1.14-SNAPSHOT
      Synology RS816

      Description

      I tried running Nutch on my Synology NAS. As SMB protocol is not contained in Nutch, I turned on FTP service on the NAS and configured Nutch to crawl ftp://nas.
      The experience gives me varying results which seem to point to problems within Nutch. However this may need further evaluation.

      As some files could not be downloaded and I could not see a good error message I changed the method org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not only return protocol status but send the full exception and stack trace to the logs:

      {{ } catch (Exception e) {
      LOG.warn("Could not get {}", url, e);
      return new ProtocolOutput(null, new ProtocolStatus(e));
      }
      }}
      With this modification I suddenly see such messages in the logfile:
      {{2017-11-09 23:44:56,135 WARN org.apache.nutch.protocol.ftp.Ftp - Error:
      java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
      at java.util.LinkedList.checkElementIndex(LinkedList.java:555)
      at java.util.LinkedList.get(LinkedList.java:476)
      at org.apache.nutch.protocol.ftp.FtpResponse.getFileAsHttpResponse(FtpResponse.java:327)
      at org.apache.nutch.protocol.ftp.FtpResponse.<init>(FtpResponse.java:267)
      at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:133)
      at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
      2017-11-09 23:44:56,135 ERROR org.apache.nutch.protocol.ftp.Ftp - Could not get protocol output for ftp://nas/MediaPC/boot/memtest86+.elf
      org.apache.nutch.protocol.ftp.FtpException: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
      at org.apache.nutch.protocol.ftp.FtpResponse.<init>(FtpResponse.java:309)
      at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:133)
      at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
      Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
      at java.util.LinkedList.checkElementIndex(LinkedList.java:555)
      at java.util.LinkedList.get(LinkedList.java:476)
      at org.apache.nutch.protocol.ftp.FtpResponse.getFileAsHttpResponse(FtpResponse.java:327)
      }}

      I cannot tell what the URLs showing this problems have in common. They seem to be regular files, however a lot of other regular files can be fetched and parsed successfully. As far as I understand the source code, at least one outgoing link is expected:
      {{
      FTPFile ftpFile = (FTPFile) list.get(0);
      }}

      Can this be safely assumed for all files? Or should there rather be a check if outgoing links were found?

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              hiran_chaudhuri Hiran Chaudhuri
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: