Issue Details (XML | Word | Printable)

Key: NUTCH-274
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Andrzej Bialecki
Reporter: Stefan Neufeind
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Nutch

Empty row in/at end of URL-list results in error

Created: 21/May/06 07:40 AM   Updated: 28/Dec/06 12:20 AM
Return to search
Component/s: None
Affects Version/s: 0.8
Fix Version/s: 0.8.2, 0.9.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works ignoreEmpthyLineDuringInjectV1.patch 2006-06-02 11:25 PM Stefan Groschupf 0.7 kB
Environment: nightly-2006-05-20

Resolution Date: 28/Dec/06 12:20 AM


 Description  « Hide
This is minor - but it's a little unclean

Reproduce: Have a URL-file with one URL followed by a newline, thus producing an empty line.

Outcome: Fetcher-threads try to fetch two URLs at the same time. First one is fine - but second is empty and therefor fails proper protocol-detection.

60521 022639 Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
060521 022639 Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
060521 022639 found resource parse-plugins.xml at file:/home/mm/nutch-nightly/conf/parse-plugins.xml
060521 022639 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
060521 022639 fetching http://www.bild.de/
060521 022639 fetching
060521 022639 fetch of failed with: org.apache.nutch.protocol.ProtocolNotFound: java.net.MalformedURLException: no protocol:
060521 022639 http.proxy.host = null
060521 022639 http.proxy.port = 8080
060521 022639 http.timeout = 10000
060521 022639 http.content.limit = 65536
060521 022639 http.agent = NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
060521 022639 fetcher.server.delay = 1000
060521 022639 http.max.delays = 1000
060521 022640 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to contentType text/xml via parse-plugins.xml, but
its plugin.xml file does not claim to support contentType: text/xml
060521 022640 ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser mapped to contentType text/xml via parse-plugins.xml, but
its plugin.xml file does not claim to support contentType: text/xml
060521 022640 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via parse-plugins.xml, but
not enabled via plugin.includes in nutch-default.xml
060521 022640 Using Signature impl: org.apache.nutch.crawl.MD5Signature
060521 022640 map 0% reduce 0%
060521 022640 1 pages, 1 errors, 1.0 pages/s, 40 kb/s,
060521 022640 1 pages, 1 errors, 1.0 pages/s, 40 kb/s,



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Stefan Groschupf made changes - 02/Jun/06 11:25 PM
Field Original Value New Value
Attachment ignoreEmpthyLineDuringInjectV1.patch [ 12334954 ]
Andrzej Bialecki made changes - 28/Dec/06 12:20 AM
Status Open [ 1 ] Closed [ 6 ]
Fix Version/s 0.9.0 [ 12312013 ]
Fix Version/s 0.8.2 [ 12312064 ]
Resolution Fixed [ 1 ]
Assignee Andrzej Bialecki [ ab ]