Description
Fetcher should optionally (by default) suspend crawling by a configurable interval when fetching the robots.txt fails with a server errors (HTTP status code 5xx, esp. 503) following Google's spec:
5xx (server error)
Server errors are seen as temporary errors that result in a "full disallow" of crawling. The request is retried until a non-server-error HTTP result code is obtained. A 503 (Service Unavailable) error will result in fairly frequent retrying. To temporarily suspend crawling, it is recommended to serve a 503 HTTP result code. Handling of a permanent server error is undefined.
See also the draft robots.txt RFC, section "Unreachable status".
Crawler-commons robots rules already provide isDeferVisits to store this information (must be set from RobotRulesParser).
Attachments
Issue Links
- links to