Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1455

RobotRulesParser to match multi-word user-agent names

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.5.1
    • 1.7, 2.2
    • protocol
    • None

    Description

      If the user-agent name(s) configured in http.robots.agents contains spaces it is not matched even if is exactly contained in the robots.txt

      http.robots.agents = "Download Ninja,*"

      If the robots.txt (http://en.wikipedia.org/robots.txt) contains

      User-agent: Download Ninja
      Disallow: /
      

      all content should be forbidden. But it isn't:

      % curl 'http://en.wikipedia.org/robots.txt' > robots.txt
      % grep -A1 -i ninja robots.txt 
      User-agent: Download Ninja
      Disallow: /
      % cat test.urls
      http://en.wikipedia.org/
      % bin/nutch plugin lib-http org.apache.nutch.protocol.http.api.RobotRulesParser robots.txt test.urls 'Download Ninja'
      ...
      allowed:        http://en.wikipedia.org/
      

      The rfc (http://www.robotstxt.org/norobots-rfc.txt) states that

      The robot must obey the first record in /robots.txt that contains a User-Agent line whose value contains the name token of the robot as a substring.

      Assumed that "Downlaod Ninja" is a substring of itself it should match and http://en.wikipedia.org/ should be forbidden.

      The point is that the agent name from the User-Agent line is split at spaces while the names from the http.robots.agents property are not (they are only split at ",").

      Attachments

        Issue Links

          Activity

            People

              tejasp Tejas Patil
              snagel Sebastian Nagel
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: