Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1418

error parsing robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.4
    • 1.7, 2.2
    • None
    • None

    Description

      Since learning that nutch will be unable to crawl the javascript function calls in href, I started looking for other alternatives. I decided to crawl http://en.wikipedia.org/wiki/Districts_of_India.
      I first tried injecting this URL and follow the step-by-step approach till fetcher - when I realized, nutch did not fetch anything from this website. I tried looking into logs/hadoop.log and found the following 3 lines - which I believe could be saying that nutch is unable to parse the robots.txt in the website and ttherefore, fetcher stopped?

      2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/
      2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots rules- can't decode path: /wiki/Wikipedia_talk%3Mediation_Committee/
      2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Cabal/Cases/

      I tried checking the URL using parsechecker and no issues there! I think it means that the robots.txt is malformed for this website, which is preventing fetcher from fetching anything. Is there a way to get around this problem, as parsechecker seems to go on its merry way parsing.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            parijip Arijit Mukherjee
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment