Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2990

HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Implemented
    • 1.19
    • 1.20
    • protocol, robots
    • None

    Description

      The robots.txt parser (HttpRobotRulesParser) follows only one redirect when fetching the robots.txt while the robots.txt RFC 9309 recommends to follow 5 redirects:

      2.3.1.2. Redirects

      It's possible that a server responds to a robots.txt fetch request with a redirect, such as HTTP 301 or HTTP 302 in the case of HTTP. The crawlers SHOULD follow at least five consecutive redirects, even across authorities (for example, hosts in the case of HTTP).
      If a robots.txt file is reached within five consecutive redirects, the robots.txt file MUST be fetched, parsed, and its rules followed in the context of the initial authority. If there are more than five consecutive redirects, crawlers MAY assume that the robots.txt file is unavailable.
      (https://datatracker.ietf.org/doc/html/rfc9309#name-redirects)

      While following redirects, the parser should check whether the redirect location is itself a "/robots.txt" on a different host and then try to read it from the cache.

      Attachments

        Issue Links

          Activity

            People

              snagel Sebastian Nagel
              snagel Sebastian Nagel
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: