Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-824

Crawling - File Error 404 when fetching file with an hexadecimal character in the file name.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.0.0, 1.2, 1.3, nutchgora
    • Fix Version/s: 1.0.0, 1.3, nutchgora
    • Component/s: fetcher
    • Labels:
      None
    • Environment:

      Linux nube 2.6.31-20-server #58-Ubuntu SMP x86_64 GNU/Linux

      Description

      Hello,

      I am performing a local file system crawling.
      My problem is the following: all files that contain some hexadecimal characters in the name do not get crawled.

      For example, I will see the following error:

      fetching file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html
      org.apache.nutch.protocol.file.FileError: File Error: 404
      at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:92)
      at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:535)
      fetch of file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html failed with: org.apache.nutch.protocol.file.FileError: File Error: 404

      I am using nutch-1.0.

      Among other standard settings, I configured nutch-site.conf as follows:

      <property>
      <name>plugin.includes</name>
      <value>protocol-file|protocol-http|urlfilter-regex|parse-(text|html|js|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
      <description>Regular expression naming plugin directory names to
      include. Any plugin not matching this expression is excluded.
      In any case you need at least include the nutch-extensionpoints plugin. By
      default Nutch includes crawling just HTML and plain text via HTTP,
      and basic indexing and search plugins. In order to use HTTPS please enable
      protocol-httpclient, but be aware of possible intermittent problems with the
      underlying commons-httpclient library.
      </description>
      </property>

      <property>
      <name>file.content.limit</name>
      <value>-1</value>
      </property>

      Moreover, crawl-urlfilter.txt looks like:

      1. skip http:, ftp:, & mailto: urls
        -^(http|ftp|mailto):
      1. skip image and other suffixes we can't yet parse
        -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
      1. skip URLs containing certain characters as probable queries, etc.
        -[?*!@=]
      1. skip URLs with slash-delimited segment that repeats 3+ times, to break loops
        -.*(/[^/])/[^/]\1/[^/]+\1/
      1. accept hosts in MY.DOMAIN.NAME
        #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
      1. accept everything else
        +.*
        ~

      Thanks,

      Michela

        Attachments

          Activity

            People

            • Assignee:
              jnioche Julien Nioche
              Reporter:
              mbecchi Michela Becchi
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: