Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-275

Fetcher not parsing XHTML-pages at all

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.8
    • 0.8
    • None
    • None
    • problem with nightly-2006-05-20; worked fine with same website on 0.7.2

    Description

      Server reports page as "text/html" - so I thought it would be processed as html.
      But something I guess evaluated the headers of the document and re-labeled it as "text/xml" (why not text/xhtml?).

      For some reason there is no plugin to be found for indexing text/xml (why does TextParser not feel responsible?).

      Links inside this document are NOT indexed at all - no digging this website actually stops here.
      Funny thing: For some magical reasons the dtd-files referenced in the header seem to be valid links for the fetcher and as such are indexed in the next round (if urlfilter allows).

      060521 025018 fetching http://www.secreturl.something/
      060521 025018 http.proxy.host = null
      060521 025018 http.proxy.port = 8080
      060521 025018 http.timeout = 10000
      060521 025018 http.content.limit = 65536
      060521 025018 http.agent = NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
      060521 025018 fetcher.server.delay = 1000
      060521 025018 http.max.delays = 1000
      060521 025018 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to contentType text/xml via parse-plugins.xml, but
      its plugin.xml file does not claim to support contentType: text/xml
      060521 025018 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via parse-plugins.xml, but
      not enabled via plugin.includes in nutch-default.xml
      060521 025019 Using Signature impl: org.apache.nutch.crawl.MD5Signature
      060521 025019 map 0% reduce 0%
      060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s,
      060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s,

      Attachments

        Activity

          People

            Unassigned Unassigned
            neufeind Stefan Neufeind
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: