[NUTCH-1752] cache robots.txt rules per protocol:host:port - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.8, 2.2.1
Fix Version/s: 2.3, 1.9
Component/s: protocol
Labels:
None

Patch Info:

Patch Available

Description

HttpRobotRulesParser caches rules from robots.txt per "protocol:host" (before ~~NUTCH-1031~~ caching was per "host" only). The caching should be per "protocol:host:port". In doubt, a request to a different port may deliver a different robots.txt.
Applying robots.txt rules to a combination of host, protocol, and port is common practice:
Norobots RFC 1996 draft does not mention this explicitly (could be derived from examples) but others do:

Wikipedia: "each protocol and port needs its own robots.txt file"
Google webmasters: "The directives listed in the robots.txt file apply only to the host, protocol and port number where the file is hosted."

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NUTCH-1752-v2.patch
25/Apr/14 21:37
3 kB
Sebastian Nagel
NUTCH-1752-v1.patch
09/Apr/14 09:03
3 kB
Sebastian Nagel

Activity

People

Assignee:: Sebastian Nagel

Reporter:: Sebastian Nagel

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 09/Apr/14 09:00

Updated:: 13/Mar/24 14:51

Resolved:: 12/May/14 19:40