Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2581

Caching of redirected robots.txt may overwrite correct robots.txt rules

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.3.1, 1.14
    • Fix Version/s: 2.4, 1.15
    • Component/s: fetcher, robots
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      Redirected robots.txt rules are also cached for the target host. That may cause that the correct robots.txt rules are never fetched. E.g., http://wyomingtheband.com/robots.txt redirects to https://www.facebook.com/wyomingtheband/robots.txt. Because fetching fails with a 404 bots are allowed to crawl wyomingtheband.com. The rules is erroneously also cached for the redirect target host www.facebook.com which is clear regarding their robots.txt rules and does not allow crawling.

      Nutch may cache redirected robots.txt rules only if the path part (in doubt, including the query) of the redirect target URL is exactly /robots.txt.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                snagel Sebastian Nagel
                Reporter:
                snagel Sebastian Nagel
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: