Details
Description
I tried running Nutch on my Synology NAS. As SMB protocol is not contained in Nutch, I turned on FTP service on the NAS and configured Nutch to crawl ftp://nas. Also I wanted to increase crawl speed and thus configured fetcher.threads.per.queue=10 in nutch-site.xml.
As some files could not be downloaded and I could not see a good error message I changed the method org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not only return protocol status but send the full exception and stack trace to the logs:
{{ } catch (Exception e) {
LOG.warn("Could not get {}", url, e);
return new ProtocolOutput(null, new ProtocolStatus(e));
}
}}
With this setup I saw such messages in the logs:
{{2017-10-25 22:52:54,699 WARN org.apache.nutch.protocol.ftp.Ftp - ftp.client.login() failed: nas/192.168.178.43
2017-10-25 22:52:54,718 WARN org.apache.nutch.protocol.ftp.Ftp - Error:
java.net.SocketException: Socket closed
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.read(BufferedReader.java:182)
at org.apache.commons.net.io.CRLFLineReader.readLine(CRLFLineReader.java:58)
at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:310)
at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:290)
at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:479)
at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:552)
at org.apache.commons.net.ftp.FTP.user(FTP.java:698)
at org.apache.nutch.protocol.ftp.Client.login(Client.java:294)
at org.apache.nutch.protocol.ftp.FtpResponse.<init>(FtpResponse.java:190)
at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:132)
at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
2017-10-25 22:52:54,721 WARN org.apache.nutch.protocol.ftp.Ftp - Could not get ftp://nas/silver-sda2/home/hiran/Desktop/Segelclub.txt~
org.apache.nutch.protocol.ftp.FtpException: java.net.SocketException: Socket closed
at org.apache.nutch.protocol.ftp.FtpResponse.<init>(FtpResponse.java:308)
at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:132)
at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
Caused by: java.net.SocketException: Socket closed
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.read(BufferedReader.java:182)
at org.apache.commons.net.io.CRLFLineReader.readLine(CRLFLineReader.java:58)
at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:310)
at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:290)
at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:479)
at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:552)
at org.apache.commons.net.ftp.FTP.user(FTP.java:698)
at org.apache.nutch.protocol.ftp.Client.login(Client.java:294)
at org.apache.nutch.protocol.ftp.FtpResponse.<init>(FtpResponse.java:190)
... 2 more
2017-10-25 22:52:54,730 WARN org.apache.nutch.protocol.ftp.Ftp - Could not get ftp://nas/silver-sda2/home/hiran/svn/glib-2.2.3/tests/cxx-test.C
org.apache.nutch.protocol.ftp.FtpException: java.net.SocketException: Socket closed
at org.apache.nutch.protocol.ftp.FtpResponse.<init>(FtpResponse.java:308)
at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:132)
at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
Caused by: java.net.SocketException: Socket closed
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.read(BufferedReader.java:182)
at org.apache.commons.net.io.CRLFLineReader.readLine(CRLFLineReader.java:58)
at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:310)
at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:290)
at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:479)
at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:552)
at org.apache.commons.net.ftp.FTP.user(FTP.java:698)
at org.apache.nutch.protocol.ftp.Client.login(Client.java:294)
at org.apache.nutch.protocol.ftp.FtpResponse.<init>(FtpResponse.java:190)
... 2 more
2017-10-25 22:52:54,734 WARN org.apache.nutch.protocol.ftp.Ftp - Could not get ftp://nas/MediaPC/usr/include/asm-generic/shmparam.h
org.apache.nutch.protocol.ftp.FtpException: java.net.SocketException: Socket is not connected
at org.apache.nutch.protocol.ftp.FtpResponse.<init>(FtpResponse.java:308)
at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:132)
at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
Caused by: java.net.SocketException: Socket is not connected
at java.net.Socket.getInputStream(Socket.java:905)
at org.apache.commons.net.SocketClient.connectAction(SocketClient.java:143)
at org.apache.commons.net.ftp.FTP.connectAction(FTP.java:374)
at org.apache.commons.net.SocketClient.connect(SocketClient.java:172)
at org.apache.commons.net.SocketClient.connect(SocketClient.java:266)
at org.apache.nutch.protocol.ftp.FtpResponse.<init>(FtpResponse.java:175)
... 2 more
2017-10-25 22:52:54,744 WARN org.apache.nutch.protocol.ftp.Ftp - Could not get ftp://nas/MediaPC/home/hiran/.compiz/
org.apache.nutch.protocol.ftp.FtpError: Ftp Error: 500
at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:151)
at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
}}
Please note that all these problems vanished when I configured fetcher.threads.per.queue back to 1 (the default value).