Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2646

CLONE - Caching of redirected robots.txt may overwrite correct robots.txt rules

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Duplicate
    • 2.3.1, 1.14
    • None
    • fetcher, robots
    • None
    • Patch Available

    Description

      Redirected robots.txt rules are also cached for the target host. That may cause that the correct robots.txt rules are never fetched. E.g., http://wyomingtheband.com/robots.txt redirects to https://www.facebook.com/wyomingtheband/robots.txt. Because fetching fails with a 404 bots are allowed to crawl wyomingtheband.com. The rules is erroneously also cached for the redirect target host www.facebook.com which is clear regarding their robots.txt rules and does not allow crawling.

      Nutch may cache redirected robots.txt rules only if the path part (in doubt, including the query) of the redirect target URL is exactly /robots.txt.

      Attachments

        Issue Links

          Activity

            People

              snagel Sebastian Nagel
              iFancy Chang Fan
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: