Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1752

cache robots.txt rules per protocol:host:port

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.8, 2.2.1
    • 2.3, 1.9
    • protocol
    • None
    • Patch Available

    Description

      HttpRobotRulesParser caches rules from robots.txt per "protocol:host" (before NUTCH-1031 caching was per "host" only). The caching should be per "protocol:host:port". In doubt, a request to a different port may deliver a different robots.txt.
      Applying robots.txt rules to a combination of host, protocol, and port is common practice:
      Norobots RFC 1996 draft does not mention this explicitly (could be derived from examples) but others do:

      • Wikipedia: "each protocol and port needs its own robots.txt file"
      • Google webmasters: "The directives listed in the robots.txt file apply only to the host, protocol and port number where the file is hosted."

      Attachments

        1. NUTCH-1752-v2.patch
          3 kB
          Sebastian Nagel
        2. NUTCH-1752-v1.patch
          3 kB
          Sebastian Nagel

        Activity

          People

            snagel Sebastian Nagel
            snagel Sebastian Nagel
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: