I am performing a local file system crawling.
My problem is the following: all files that contain some hexadecimal characters in the name do not get crawled.
For example, I will see the following error:
fetching file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html
org.apache.nutch.protocol.file.FileError: File Error: 404
at org.apache.nutch.protocol.file.File.getProtocolOutput(
at org.apache.nutch.fetcher.Fetcher$
fetch of file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html failed with: org.apache.nutch.protocol.file.FileError: File Error: 404
I am using nutch-1.0.
Among other standard settings, I configured nutch-site.conf as follows:
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
Moreover, crawl-urlfilter.txt looks like:
- skip http:, ftp:, & mailto: urls
- skip image and other suffixes we can't yet parse
- skip URLs containing certain characters as probable queries, etc.
- skip URLs with slash-delimited segment that repeats 3+ times, to break loops
- accept hosts in MY.DOMAIN.NAME
- accept everything else