Details
-
Improvement
-
Status: Closed
-
Minor
-
Resolution: Duplicate
-
1.4
-
None
-
None
-
None
Description
The Bixo project has an improved version of Nutch's robots.txt parsing code.
This was recently contributed to crawler-commons, in a format that should be independent of Bixo, Cascading, and even Hadoop.
Nutch could switch to this, and benefit from more robust parsing, better compliance with ad hoc extensions to the robot exclusion protocol, and a wider community of users/developers for that code.
Attachments
Issue Links
- duplicates
-
NUTCH-1031 Delegate parsing of robots.txt to crawler-commons
- Closed