[NUTCH-824] Crawling - File Error 404 when fetching file with an hexadecimal character in the file name. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.0.0, 1.2, 1.3, nutchgora
Fix Version/s: 1.0.0, 1.3, nutchgora
Component/s: fetcher
Labels:
None
Environment:

Linux nube 2.6.31-20-server #58-Ubuntu SMP x86_64 GNU/Linux

Description

Hello,

I am performing a local file system crawling.
My problem is the following: all files that contain some hexadecimal characters in the name do not get crawled.

For example, I will see the following error:

fetching file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html
org.apache.nutch.protocol.file.FileError: File Error: 404
at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:92)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:535)
fetch of file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html failed with: org.apache.nutch.protocol.file.FileError: File Error: 404

I am using nutch-1.0.

Among other standard settings, I configured nutch-site.conf as follows:

<property>
<name>file.content.limit</name>
<value>-1</value>
</property>

Moreover, crawl-urlfilter.txt looks like:

skip http:, ftp:, & mailto: urls
-^(http|ftp|mailto):

skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/])/[^/]\1/[^/]+\1/

accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

accept everything else
+.*
~

—

Thanks,

Michela

Attachments

Activity

People

Assignee:: Julien Nioche

Reporter:: Michela Becchi

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 20/May/10 20:26

Updated:: 25/Jun/11 12:53

Resolved:: 07/Jan/11 17:18