[NUTCH-1418] error parsing robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/ - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.4
Fix Version/s: 1.7, 2.2
Component/s: None
Labels:
None

Description

Since learning that nutch will be unable to crawl the javascript function calls in href, I started looking for other alternatives. I decided to crawl http://en.wikipedia.org/wiki/Districts_of_India.
I first tried injecting this URL and follow the step-by-step approach till fetcher - when I realized, nutch did not fetch anything from this website. I tried looking into logs/hadoop.log and found the following 3 lines - which I believe could be saying that nutch is unable to parse the robots.txt in the website and ttherefore, fetcher stopped?

2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/
2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots rules- can't decode path: /wiki/Wikipedia_talk%3Mediation_Committee/
2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Cabal/Cases/

I tried checking the URL using parsechecker and no issues there! I think it means that the robots.txt is malformed for this website, which is preventing fetcher from fetching anything. Is there a way to get around this problem, as parsechecker seems to go on its merry way parsing.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Arijit Mukherjee

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 02/Jul/12 18:05

Updated:: 22/May/13 03:53

Resolved:: 12/May/13 11:50