Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2452

Problem retrieving encoded URLs via FTP?

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.13
    • Fix Version/s: 1.14
    • Component/s: protocol
    • Labels:
      None
    • Environment:

      Ubuntu 16.04.3 LTS
      OpenJDK 1.8.0_131
      nutch 1.14-SNAPSHOT
      Synology RS816

      Description

      I tried running Nutch on my Synology NAS. As SMB protocol is not contained in Nutch, I turned on FTP service on the NAS and configured Nutch to crawl ftp://nas.
      The experience gives me varying results which seem to point to problems within Nutch. However this may need further evaluation.

      As some files could not be downloaded and I could not see a good error message I changed the method org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not only return protocol status but send the full exception and stack trace to the logs:

      {{ } catch (Exception e) {
      LOG.warn("Could not get {}", url, e);
      return new ProtocolOutput(null, new ProtocolStatus(e));
      }
      }}
      With this modification I suddenly see such messages in the logfile:
      {{2017-10-25 14:14:37,254 TRACE org.apache.nutch.protocol.ftp.Ftp - fetching ftp://nas/silver-sda2/home/vivi/Desktop/Pictures/Kenya%20Pics/
      2017-10-25 14:14:37,512 WARN org.apache.nutch.protocol.ftp.Ftp - Could not get ftp://nas/silver-sda2/home/vivi/Desktop/Pictures/Kenya%20Pics/
      org.apache.nutch.protocol.ftp.FtpError: Ftp Error: 404
      at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:151)
      at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
      }}

      Please mind the URL was not configured from me. Instead it was obtained by crawling my NAS. Also the URL looks perfectly fine to me. Even more, using Firefox and the same authentication data on the same URL displays the directory successfully. Therefore I suspect the FTP client is unable to decode the URL such that the FTP server would understand it.

        Activity

        Hide
        hiranchaudhuri Hiran Chaudhuri added a comment -

        I pulled the change and it looks good from my side. Thank you.

        Show
        hiranchaudhuri Hiran Chaudhuri added a comment - I pulled the change and it looks good from my side. Thank you.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Nutch-trunk #3465 (See https://builds.apache.org/job/Nutch-trunk/3465/)
        NUTCH-2452 Allow nutch to retrieve Ftp URLs that contain UrlEncoded (snagel: https://github.com/apache/nutch/commit/517dbdf3261d42e90883d07320b7991ff8e2bcf8)

        • (edit) src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/FtpResponse.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Nutch-trunk #3465 (See https://builds.apache.org/job/Nutch-trunk/3465/ ) NUTCH-2452 Allow nutch to retrieve Ftp URLs that contain UrlEncoded (snagel: https://github.com/apache/nutch/commit/517dbdf3261d42e90883d07320b7991ff8e2bcf8 ) (edit) src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/FtpResponse.java
        Hide
        wastl-nagel Sebastian Nagel added a comment -

        Picked 61e0ae7 from pull-request #237. Thanks, Hiran Chaudhuri!

        Show
        wastl-nagel Sebastian Nagel added a comment - Picked 61e0ae7 from pull-request #237 . Thanks, Hiran Chaudhuri !
        Hide
        wastl-nagel Sebastian Nagel added a comment -

        Thanks, this should be fixed.

        Show
        wastl-nagel Sebastian Nagel added a comment - Thanks, this should be fixed.
        Hide
        hiranchaudhuri Hiran Chaudhuri added a comment -

        It seems I am able to fix the problem with this line in method org.apache.nutch.protocol.ftp.FtpResponse(URL, CrawlDatum, Ftp, Configuration):

        path = java.net.URLDecoder.decode(path, "UTF-8");

        Show
        hiranchaudhuri Hiran Chaudhuri added a comment - It seems I am able to fix the problem with this line in method org.apache.nutch.protocol.ftp.FtpResponse(URL, CrawlDatum, Ftp, Configuration): path = java.net.URLDecoder.decode(path, "UTF-8");

          People

          • Assignee:
            Unassigned
            Reporter:
            hiranchaudhuri Hiran Chaudhuri
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development