[NUTCH-2581] Caching of redirected robots.txt may overwrite correct robots.txt rules - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 2.3.1, 1.14
Fix Version/s: 2.4, 1.15
Component/s: fetcher, robots
Labels:
None

Patch Info:

Patch Available

Description

Redirected robots.txt rules are also cached for the target host. That may cause that the correct robots.txt rules are never fetched. E.g., http://wyomingtheband.com/robots.txt redirects to https://www.facebook.com/wyomingtheband/robots.txt. Because fetching fails with a 404 bots are allowed to crawl wyomingtheband.com. The rules is erroneously also cached for the redirect target host www.facebook.com which is clear regarding their robots.txt rules and does not allow crawling.

Nutch may cache redirected robots.txt rules only if the path part (in doubt, including the query) of the redirect target URL is exactly /robots.txt.

Attachments

Issue Links

is cloned by

NUTCH-2646 CLONE - Caching of redirected robots.txt may overwrite correct robots.txt rules

Closed

relates to

NUTCH-731 Redirection of robots.txt in RobotRulesParser

Closed

links to

GitHub Pull Request #331

GitHub Pull Request #342

Activity

People

Assignee:: Sebastian Nagel

Reporter:: Sebastian Nagel

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 22/May/18 13:03

Updated:: 01/Oct/19 14:29

Resolved:: 08/Jun/18 09:52