Details
Description
HttpRobotRulesParser caches rules from robots.txt per "protocol:host" (before NUTCH-1031 caching was per "host" only). The caching should be per "protocol:host:port". In doubt, a request to a different port may deliver a different robots.txt.
Applying robots.txt rules to a combination of host, protocol, and port is common practice:
Norobots RFC 1996 draft does not mention this explicitly (could be derived from examples) but others do:
- Wikipedia: "each protocol and port needs its own robots.txt file"
- Google webmasters: "The directives listed in the robots.txt file apply only to the host, protocol and port number where the file is hosted."