Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2453

FTP protocol seems to have issues running multithreaded

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.13
    • Fix Version/s: 1.19
    • Component/s: protocol
    • Labels:
      None
    • Environment:

      Ubuntu 16.04.3 LTS
      OpenJDK 1.8.0_131
      nutch 1.14-SNAPSHOT
      Synology RS816

      Description

      I tried running Nutch on my Synology NAS. As SMB protocol is not contained in Nutch, I turned on FTP service on the NAS and configured Nutch to crawl ftp://nas. Also I wanted to increase crawl speed and thus configured fetcher.threads.per.queue=10 in nutch-site.xml.
      As some files could not be downloaded and I could not see a good error message I changed the method org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not only return protocol status but send the full exception and stack trace to the logs:
      {{ } catch (Exception e) {
      LOG.warn("Could not get {}", url, e);
      return new ProtocolOutput(null, new ProtocolStatus(e));
      }
      }}

      With this setup I saw such messages in the logs:
      {{2017-10-25 22:52:54,699 WARN org.apache.nutch.protocol.ftp.Ftp - ftp.client.login() failed: nas/192.168.178.43
      2017-10-25 22:52:54,718 WARN org.apache.nutch.protocol.ftp.Ftp - Error:
      java.net.SocketException: Socket closed
      at java.net.SocketInputStream.socketRead0(Native Method)
      at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
      at java.net.SocketInputStream.read(SocketInputStream.java:171)
      at java.net.SocketInputStream.read(SocketInputStream.java:141)
      at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
      at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
      at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
      at java.io.InputStreamReader.read(InputStreamReader.java:184)
      at java.io.BufferedReader.fill(BufferedReader.java:161)
      at java.io.BufferedReader.read(BufferedReader.java:182)
      at org.apache.commons.net.io.CRLFLineReader.readLine(CRLFLineReader.java:58)
      at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:310)
      at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:290)
      at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:479)
      at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:552)
      at org.apache.commons.net.ftp.FTP.user(FTP.java:698)
      at org.apache.nutch.protocol.ftp.Client.login(Client.java:294)
      at org.apache.nutch.protocol.ftp.FtpResponse.<init>(FtpResponse.java:190)
      at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:132)
      at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
      2017-10-25 22:52:54,721 WARN org.apache.nutch.protocol.ftp.Ftp - Could not get ftp://nas/silver-sda2/home/hiran/Desktop/Segelclub.txt~
      org.apache.nutch.protocol.ftp.FtpException: java.net.SocketException: Socket closed
      at org.apache.nutch.protocol.ftp.FtpResponse.<init>(FtpResponse.java:308)
      at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:132)
      at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
      Caused by: java.net.SocketException: Socket closed
      at java.net.SocketInputStream.socketRead0(Native Method)
      at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
      at java.net.SocketInputStream.read(SocketInputStream.java:171)
      at java.net.SocketInputStream.read(SocketInputStream.java:141)
      at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
      at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
      at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
      at java.io.InputStreamReader.read(InputStreamReader.java:184)
      at java.io.BufferedReader.fill(BufferedReader.java:161)
      at java.io.BufferedReader.read(BufferedReader.java:182)
      at org.apache.commons.net.io.CRLFLineReader.readLine(CRLFLineReader.java:58)
      at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:310)
      at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:290)
      at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:479)
      at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:552)
      at org.apache.commons.net.ftp.FTP.user(FTP.java:698)
      at org.apache.nutch.protocol.ftp.Client.login(Client.java:294)
      at org.apache.nutch.protocol.ftp.FtpResponse.<init>(FtpResponse.java:190)
      ... 2 more

      2017-10-25 22:52:54,730 WARN org.apache.nutch.protocol.ftp.Ftp - Could not get ftp://nas/silver-sda2/home/hiran/svn/glib-2.2.3/tests/cxx-test.C
      org.apache.nutch.protocol.ftp.FtpException: java.net.SocketException: Socket closed
      at org.apache.nutch.protocol.ftp.FtpResponse.<init>(FtpResponse.java:308)
      at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:132)
      at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
      Caused by: java.net.SocketException: Socket closed
      at java.net.SocketInputStream.socketRead0(Native Method)
      at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
      at java.net.SocketInputStream.read(SocketInputStream.java:171)
      at java.net.SocketInputStream.read(SocketInputStream.java:141)
      at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
      at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
      at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
      at java.io.InputStreamReader.read(InputStreamReader.java:184)
      at java.io.BufferedReader.fill(BufferedReader.java:161)
      at java.io.BufferedReader.read(BufferedReader.java:182)
      at org.apache.commons.net.io.CRLFLineReader.readLine(CRLFLineReader.java:58)
      at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:310)
      at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:290)
      at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:479)
      at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:552)
      at org.apache.commons.net.ftp.FTP.user(FTP.java:698)
      at org.apache.nutch.protocol.ftp.Client.login(Client.java:294)
      at org.apache.nutch.protocol.ftp.FtpResponse.<init>(FtpResponse.java:190)
      ... 2 more

      2017-10-25 22:52:54,734 WARN org.apache.nutch.protocol.ftp.Ftp - Could not get ftp://nas/MediaPC/usr/include/asm-generic/shmparam.h
      org.apache.nutch.protocol.ftp.FtpException: java.net.SocketException: Socket is not connected
      at org.apache.nutch.protocol.ftp.FtpResponse.<init>(FtpResponse.java:308)
      at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:132)
      at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
      Caused by: java.net.SocketException: Socket is not connected
      at java.net.Socket.getInputStream(Socket.java:905)
      at org.apache.commons.net.SocketClient.connectAction(SocketClient.java:143)
      at org.apache.commons.net.ftp.FTP.connectAction(FTP.java:374)
      at org.apache.commons.net.SocketClient.connect(SocketClient.java:172)
      at org.apache.commons.net.SocketClient.connect(SocketClient.java:266)
      at org.apache.nutch.protocol.ftp.FtpResponse.<init>(FtpResponse.java:175)
      ... 2 more

      2017-10-25 22:52:54,744 WARN org.apache.nutch.protocol.ftp.Ftp - Could not get ftp://nas/MediaPC/home/hiran/.compiz/
      org.apache.nutch.protocol.ftp.FtpError: Ftp Error: 500
      at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:151)
      at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
      }}

      Please note that all these problems vanished when I configured fetcher.threads.per.queue back to 1 (the default value).

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              hiranchaudhuri Hiran Chaudhuri
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: