Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1483

Can't crawl filesystem with protocol-file plugin

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 1.6, 2.1
    • 2.3, 1.10
    • protocol
    • None
    • OpenSUSE 12.1, OpenJDK 1.6.0, HBase 0.90.4

    • Patch Available

    Description

      I tried to follow the same steps described in this wiki page:

      http://wiki.apache.org/nutch/IntranetDocumentSearch

      I made all required changes on regex-urlfilter.txt and added the following entry in my seed file:

      file:///home/rogerio/Documents/

      The permissions are ok, I'm running nutch with the same user as folder owner, so nutch has all the required permissions, unfortunately I'm getting the following error:

      org.apache.nutch.protocol.file.FileError: File Error: 404
      at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:105)
      at org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:514)
      fetch of file://home/rogerio/Documents/ failed with: org.apache.nutch.protocol.file.FileError: File Error: 404

      Why the logs are showing file://home/rogerio/Documents/ instead of file:///home/rogerio/Documents/ ???

      Note: The regex-urlfilter entry only works as expected if I add the entry
      +file://home/rogerio/Documents/ instead of +file:///home/rogerio/Documents/ as wiki says.

      Attachments

        Issue Links

        There are no Sub-Tasks for this issue.

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            ararog Rogério Pereira Araújo
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment