Details
Description
I tried to follow the same steps described in this wiki page:
http://wiki.apache.org/nutch/IntranetDocumentSearch
I made all required changes on regex-urlfilter.txt and added the following entry in my seed file:
file:///home/rogerio/Documents/
The permissions are ok, I'm running nutch with the same user as folder owner, so nutch has all the required permissions, unfortunately I'm getting the following error:
org.apache.nutch.protocol.file.FileError: File Error: 404
at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:105)
at org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:514)
fetch of file://home/rogerio/Documents/ failed with: org.apache.nutch.protocol.file.FileError: File Error: 404
Why the logs are showing file://home/rogerio/Documents/ instead of file:///home/rogerio/Documents/ ???
Note: The regex-urlfilter entry only works as expected if I add the entry
+file://home/rogerio/Documents/ instead of +file:///home/rogerio/Documents/ as wiki says.
Attachments
Attachments
Issue Links
- duplicates
-
NUTCH-1076 Solrindex has no documents following bin/nutch solrindex when using protocol-file
- Closed
- is duplicated by
-
NUTCH-1076 Solrindex has no documents following bin/nutch solrindex when using protocol-file
- Closed
1.
|
urlnormalizer-regex to keep third slash in file:///path/index.html | Closed | Unassigned | |
2.
|
Regex URL normalizer should remove multiple slashes after file: protocol | Closed | Unassigned | |
3.
|
URLUtil should not add additional slashes for file URLs | Closed | Unassigned | |
4.
|
Protocol-file should treat symbolic links as redirects | Closed | Unassigned |