Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2573

Suspend crawling if robots.txt fails to fetch with 5xx status

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Implemented
    • 1.14
    • 1.19
    • fetcher
    • None

    Description

      Fetcher should optionally (by default) suspend crawling by a configurable interval when fetching the robots.txt fails with a server errors (HTTP status code 5xx, esp. 503) following Google's spec:
      5xx (server error)
      Server errors are seen as temporary errors that result in a "full disallow" of crawling. The request is retried until a non-server-error HTTP result code is obtained. A 503 (Service Unavailable) error will result in fairly frequent retrying. To temporarily suspend crawling, it is recommended to serve a 503 HTTP result code. Handling of a permanent server error is undefined.

      See also the draft robots.txt RFC, section "Unreachable status".

      Crawler-commons robots rules already provide isDeferVisits to store this information (must be set from RobotRulesParser).

      Attachments

        Issue Links

          Activity

            People

              snagel Sebastian Nagel
              snagel Sebastian Nagel
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: