Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2775

Fetcher to guarantee minimum delay even if robots.txt defines shorter Crawl-delay

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.16
    • 1.17
    • fetcher, robots
    • None

    Description

      Fetcher uses the amount of seconds defined by "fetcher.server.delay" to delay between successive requests to the same server. Servers can request a longer delay using the Crawl-Delay directive in the robots.txt. This was thought to allow servers to set a longer delay. However, I've recently seen a server requesting "Crawl-Delay: 1". The delay is shorter than the default delay and Nutch may indeed now request one page per second. Later this server responds with "HTTP 429 Too Many Request". Stupid. What about ignoring Crawl-Delay values shorter than the configured default delay or a configurable minimum delay?

      I've already seen the same issue using a different crawler architecture.

      Attachments

        Issue Links

          Activity

            People

              snagel Sebastian Nagel
              snagel Sebastian Nagel
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: