[NUTCH-1455] RobotRulesParser to match multi-word user-agent names - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.5.1
Fix Version/s: 1.7, 2.2
Component/s: protocol
Labels:
None

Description

If the user-agent name(s) configured in http.robots.agents contains spaces it is not matched even if is exactly contained in the robots.txt

http.robots.agents = "Download Ninja,*"

If the robots.txt (http://en.wikipedia.org/robots.txt) contains

User-agent: Download Ninja
Disallow: /

all content should be forbidden. But it isn't:

% curl 'http://en.wikipedia.org/robots.txt' > robots.txt
% grep -A1 -i ninja robots.txt 
User-agent: Download Ninja
Disallow: /
% cat test.urls
http://en.wikipedia.org/
% bin/nutch plugin lib-http org.apache.nutch.protocol.http.api.RobotRulesParser robots.txt test.urls 'Download Ninja'
...
allowed:        http://en.wikipedia.org/

The rfc (http://www.robotstxt.org/norobots-rfc.txt) states that

The robot must obey the first record in /robots.txt that contains a User-Agent line whose value contains the name token of the robot as a substring.

Assumed that "Downlaod Ninja" is a substring of itself it should match and http://en.wikipedia.org/ should be forbidden.

The point is that the agent name from the User-Agent line is split at spaces while the names from the http.robots.agents property are not (they are only split at ",").

Attachments

Issue Links

relates to

NUTCH-1031 Delegate parsing of robots.txt to crawler-commons

Closed

Activity

People

Assignee:: Tejas Patil

Reporter:: Sebastian Nagel

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 14/Aug/12 22:02

Updated:: 29/Apr/13 20:32

Resolved:: 29/Apr/13 20:32