Description
If the user-agent name(s) configured in http.robots.agents contains spaces it is not matched even if is exactly contained in the robots.txt
http.robots.agents = "Download Ninja,*"
If the robots.txt (http://en.wikipedia.org/robots.txt) contains
User-agent: Download Ninja Disallow: /
all content should be forbidden. But it isn't:
% curl 'http://en.wikipedia.org/robots.txt' > robots.txt % grep -A1 -i ninja robots.txt User-agent: Download Ninja Disallow: / % cat test.urls http://en.wikipedia.org/ % bin/nutch plugin lib-http org.apache.nutch.protocol.http.api.RobotRulesParser robots.txt test.urls 'Download Ninja' ... allowed: http://en.wikipedia.org/
The rfc (http://www.robotstxt.org/norobots-rfc.txt) states that
The robot must obey the first record in /robots.txt that contains a User-Agent line whose value contains the name token of the robot as a substring.
Assumed that "Downlaod Ninja" is a substring of itself it should match and http://en.wikipedia.org/ should be forbidden.
The point is that the agent name from the User-Agent line is split at spaces while the names from the http.robots.agents property are not (they are only split at ",").
Attachments
Issue Links
- relates to
-
NUTCH-1031 Delegate parsing of robots.txt to crawler-commons
- Closed