Details
Description
Redirected robots.txt rules are also cached for the target host. That may cause that the correct robots.txt rules are never fetched. E.g., http://wyomingtheband.com/robots.txt redirects to https://www.facebook.com/wyomingtheband/robots.txt. Because fetching fails with a 404 bots are allowed to crawl wyomingtheband.com. The rules is erroneously also cached for the redirect target host www.facebook.com which is clear regarding their robots.txt rules and does not allow crawling.
Nutch may cache redirected robots.txt rules only if the path part (in doubt, including the query) of the redirect target URL is exactly /robots.txt.
Attachments
Issue Links
- is cloned by
-
NUTCH-2646 CLONE - Caching of redirected robots.txt may overwrite correct robots.txt rules
- Closed
- relates to
-
NUTCH-731 Redirection of robots.txt in RobotRulesParser
- Closed
- links to