Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1483

Can't crawl filesystem with protocol-file plugin

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.6, 2.1
    • Fix Version/s: 2.3, 1.10
    • Component/s: protocol
    • Labels:
      None
    • Environment:

      OpenSUSE 12.1, OpenJDK 1.6.0, HBase 0.90.4

    • Patch Info:
      Patch Available

      Description

      I tried to follow the same steps described in this wiki page:

      http://wiki.apache.org/nutch/IntranetDocumentSearch

      I made all required changes on regex-urlfilter.txt and added the following entry in my seed file:

      file:///home/rogerio/Documents/

      The permissions are ok, I'm running nutch with the same user as folder owner, so nutch has all the required permissions, unfortunately I'm getting the following error:

      org.apache.nutch.protocol.file.FileError: File Error: 404
      at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:105)
      at org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:514)
      fetch of file://home/rogerio/Documents/ failed with: org.apache.nutch.protocol.file.FileError: File Error: 404

      Why the logs are showing file://home/rogerio/Documents/ instead of file:///home/rogerio/Documents/ ???

      Note: The regex-urlfilter entry only works as expected if I add the entry
      +file://home/rogerio/Documents/ instead of +file:///home/rogerio/Documents/ as wiki says.

        Attachments

        1. TestProtocolFileUrlUri.java
          0.8 kB
          Sebastian Nagel

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                ararog Rogério Pereira Araújo
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: