Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1483

Can't crawl filesystem with protocol-file plugin

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 1.6, 2.1
    • 2.3, 1.10
    • protocol
    • None
    • OpenSUSE 12.1, OpenJDK 1.6.0, HBase 0.90.4

    • Patch Available

    Description

      I tried to follow the same steps described in this wiki page:

      http://wiki.apache.org/nutch/IntranetDocumentSearch

      I made all required changes on regex-urlfilter.txt and added the following entry in my seed file:

      file:///home/rogerio/Documents/

      The permissions are ok, I'm running nutch with the same user as folder owner, so nutch has all the required permissions, unfortunately I'm getting the following error:

      org.apache.nutch.protocol.file.FileError: File Error: 404
      at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:105)
      at org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:514)
      fetch of file://home/rogerio/Documents/ failed with: org.apache.nutch.protocol.file.FileError: File Error: 404

      Why the logs are showing file://home/rogerio/Documents/ instead of file:///home/rogerio/Documents/ ???

      Note: The regex-urlfilter entry only works as expected if I add the entry
      +file://home/rogerio/Documents/ instead of +file:///home/rogerio/Documents/ as wiki says.

      Attachments

        1. TestProtocolFileUrlUri.java
          0.8 kB
          Sebastian Nagel

        Issue Links

          There are no Sub-Tasks for this issue.

          Activity

            People

              Unassigned Unassigned
              ararog Rogério Pereira Araújo
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: