Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1752

cache robots.txt rules per protocol:host:port

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.8, 2.2.1
    • Fix Version/s: 2.3, 1.9
    • Component/s: protocol
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      HttpRobotRulesParser caches rules from robots.txt per "protocol:host" (before NUTCH-1031 caching was per "host" only). The caching should be per "protocol:host:port". In doubt, a request to a different port may deliver a different robots.txt.
      Applying robots.txt rules to a combination of host, protocol, and port is common practice:
      Norobots RFC 1996 draft does not mention this explicitly (could be derived from examples) but others do:

      • Wikipedia: "each protocol and port needs its own robots.txt file"
      • Google webmasters: "The directives listed in the robots.txt file apply only to the host, protocol and port number where the file is hosted."

        Attachments

        1. NUTCH-1752-v2.patch
          3 kB
          Sebastian Nagel
        2. NUTCH-1752-v1.patch
          3 kB
          Sebastian Nagel

          Activity

            People

            • Assignee:
              snagel Sebastian Nagel
              Reporter:
              snagel Sebastian Nagel
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: