Description
Crawler-commons 1.4 (#1085) robots.txt parser (SimpleRobotRulesParser) introduces a new API entry point to parse the robots.txt content:
- it's more efficient by accepting a collection of lower-cased, single-word user-agent product tokens, without the need to tokenize a (comma-separated) list of user-agent strings again with every robots.txt
- user-agent matching is compliant with RFC 9309 (section 2.2.1) only if the new API method is used
Attachments
Issue Links
- Dependency
-
NUTCH-2995 Upgrade to crawler-commons 1.4
- Closed
- links to