Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-274

Empty row in/at end of URL-list results in error

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 0.8
    • 0.8.2, 0.9.0
    • None
    • None
    • nightly-2006-05-20

    Description

      This is minor - but it's a little unclean

      Reproduce: Have a URL-file with one URL followed by a newline, thus producing an empty line.

      Outcome: Fetcher-threads try to fetch two URLs at the same time. First one is fine - but second is empty and therefor fails proper protocol-detection.

      60521 022639 Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
      060521 022639 Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
      060521 022639 found resource parse-plugins.xml at file:/home/mm/nutch-nightly/conf/parse-plugins.xml
      060521 022639 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
      060521 022639 fetching http://www.bild.de/
      060521 022639 fetching
      060521 022639 fetch of failed with: org.apache.nutch.protocol.ProtocolNotFound: java.net.MalformedURLException: no protocol:
      060521 022639 http.proxy.host = null
      060521 022639 http.proxy.port = 8080
      060521 022639 http.timeout = 10000
      060521 022639 http.content.limit = 65536
      060521 022639 http.agent = NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
      060521 022639 fetcher.server.delay = 1000
      060521 022639 http.max.delays = 1000
      060521 022640 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to contentType text/xml via parse-plugins.xml, but
      its plugin.xml file does not claim to support contentType: text/xml
      060521 022640 ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser mapped to contentType text/xml via parse-plugins.xml, but
      its plugin.xml file does not claim to support contentType: text/xml
      060521 022640 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via parse-plugins.xml, but
      not enabled via plugin.includes in nutch-default.xml
      060521 022640 Using Signature impl: org.apache.nutch.crawl.MD5Signature
      060521 022640 map 0% reduce 0%
      060521 022640 1 pages, 1 errors, 1.0 pages/s, 40 kb/s,
      060521 022640 1 pages, 1 errors, 1.0 pages/s, 40 kb/s,

      Attachments

        1. ignoreEmpthyLineDuringInjectV1.patch
          0.7 kB
          Stefan Groschupf

        Activity

          People

            ab Andrzej Bialecki
            neufeind Stefan Neufeind
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: