Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1716

RobotRulesParser adds extra '*' to the robots name

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 1.7, 2.2.1
    • 2.3, 1.8
    • fetcher
    • None

    Description

      In RobotRulesParser, when Nutch creates a agent string from multiple agents, it combines agents from both 'http.agent.name' and 'http.robots.agents'. Along with that it appends a wildcard (ie. *) to it in the end. This is sent to crawler commons while parsing the rules. The wildcard gets matched first in robots file with (User-agent: *) if that comes before any other matching rule thus resulting in a allowed url being robots denied.

      This bug was reported by @Markus Jelsma. The discussion over nutch-user can be found here:
      http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E

      Attachments

        Activity

          People

            tejasp Tejas Patil
            tejasp Tejas Patil
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: