Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2451

MalformedURLExceptions on perfectly looking URLs?

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.13
    • Fix Version/s: None
    • Component/s: protocol
    • Labels:
      None
    • Environment:

      Ubuntu 16.04.3 LTS
      OpenJDK 1.8.0_131
      nutch 1.14-SNAPSHOT
      Synology RS816

      Description

      I tried running Nutch on my Synology NAS. As SMB protocol is not contained in Nutch, I turned on FTP service on the NAS and configured Nutch to crawl ftp://nas.
      The experience gives me varying results which seem to point to problems within Nutch. However this may need further evaluation.

      As some files could not be downloaded and I could not see a good error message I changed the method org.apache.nutch.protocol.ftp.FTP.getProtocolOutput(Text, CrawlDatum) to not only return protocol status but send the full exception and stack trace to the logs:

      {{ } catch (Exception e) {
      LOG.warn("Could not get {}", url, e);
      return new ProtocolOutput(null, new ProtocolStatus(e));
      }
      }}
      With this modification I suddenly see such messages in the logfile:
      {{2017-10-25 22:09:31,865 TRACE org.apache.nutch.protocol.ftp.Ftp - fetching ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so
      2017-10-25 22:09:32,147 WARN org.apache.nutch.protocol.ftp.Ftp - Could not get ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so
      java.net.MalformedURLException
      at java.net.URL.<init>(URL.java:627)
      at java.net.URL.<init>(URL.java:490)
      at java.net.URL.<init>(URL.java:439)
      at org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:145)
      at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:340)
      Caused by: java.lang.NullPointerException
      }}

      Please mind the URL was not configured from me. Instead it was obtained by crawling my NAS. Also the URL looks perfectly fine to me. Even if the file did not exist I would not expect a MalformedURLException to occur. Even more, using Firefox and the same authentication data on the same URL retrieves the file successfully.

      How come Nutch cannot get the file?

        Activity

        Hide
        hiran_chaudhuri Hiran Chaudhuri added a comment -

        Your suggested fix runs well for me. I created a pull request.

        Show
        hiran_chaudhuri Hiran Chaudhuri added a comment - Your suggested fix runs well for me. I created a pull request.
        Hide
        wastl-nagel Sebastian Nagel added a comment - - edited

        Ok, after a look at the code (Ftp.java): it's during redirect handling. I didn't check the Ftp spec but in HTTP redirects may absolute or relative. For the latter case it should be: u = new URL(u, response.getHeader("Location")); (within a try block to catch and log the exception with URL and redirect location).

        Show
        wastl-nagel Sebastian Nagel added a comment - - edited Ok, after a look at the code (Ftp.java): it's during redirect handling. I didn't check the Ftp spec but in HTTP redirects may absolute or relative. For the latter case it should be: u = new URL(u, response.getHeader("Location")); (within a try block to catch and log the exception with URL and redirect location).
        Hide
        hiranchaudhuri Hiran Chaudhuri added a comment - - edited

        Let's assume no suitable URLStreamHandler is registered. The PluginRepository - as it carries my proposed changes from NUTCH-2429 - is registered as URLStreamHanderFactory. So it definitely should be involved when the ftp:// URL is constructed. Here either it finds a suitable URLStreamHandler that was provided from a plugin, or otherwise it falls back to the JVM default methods, which definitely can handle ftp:// URLs. The fact that a suitable URLStreamHandler is either found by the URLStreamHandlerFactory or by the JVM is evident as I just provided the ftp://nas URL, and nutch crawled successfully to find the offending URL ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so. It would not have worked if FTP support were missing completely.
        Therefore I believe the assumption is wong. A suitable URLStreamHandler is available at runtime.

        Upon further analysis I find that the stack trace is pointing to source code org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:145) which boils down to
        u = new URL(response.getHeader("Location"));
        means the URL that gets constructed is not the FTP url we see in the log output but the value of a header, which may have not been set by the protocol-ftp plugin.
        Therefore I do not agree that NUTCH-2429 could be related or even the cause for this problem.

        Show
        hiranchaudhuri Hiran Chaudhuri added a comment - - edited Let's assume no suitable URLStreamHandler is registered. The PluginRepository - as it carries my proposed changes from NUTCH-2429 - is registered as URLStreamHanderFactory. So it definitely should be involved when the ftp:// URL is constructed. Here either it finds a suitable URLStreamHandler that was provided from a plugin, or otherwise it falls back to the JVM default methods, which definitely can handle ftp:// URLs. The fact that a suitable URLStreamHandler is either found by the URLStreamHandlerFactory or by the JVM is evident as I just provided the ftp://nas URL, and nutch crawled successfully to find the offending URL ftp://nas/MediaPC/usr/lib32/gconv/ARMSCII-8.so . It would not have worked if FTP support were missing completely. Therefore I believe the assumption is wong. A suitable URLStreamHandler is available at runtime. Upon further analysis I find that the stack trace is pointing to source code org.apache.nutch.protocol.ftp.Ftp.getProtocolOutput(Ftp.java:145) which boils down to u = new URL(response.getHeader("Location")); means the URL that gets constructed is not the FTP url we see in the log output but the value of a header, which may have not been set by the protocol-ftp plugin. Therefore I do not agree that NUTCH-2429 could be related or even the cause for this problem.
        Hide
        wastl-nagel Sebastian Nagel added a comment -

        This problem resembles those discussed in NUTCH-2429: for some reason (maybe a race condition or a class path issue) there is no ftp URLStreamHandler registered at this point. There must have been one if crawling over ftp succeeded so far (pages fetched, new ftp:// URLs found).

        Show
        wastl-nagel Sebastian Nagel added a comment - This problem resembles those discussed in NUTCH-2429 : for some reason (maybe a race condition or a class path issue) there is no ftp URLStreamHandler registered at this point. There must have been one if crawling over ftp succeeded so far (pages fetched, new ftp:// URLs found).

          People

          • Assignee:
            Unassigned
            Reporter:
            hiranchaudhuri Hiran Chaudhuri
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development