[NUTCH-2573] Suspend crawling if robots.txt fails to fetch with 5xx status - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Implemented
Affects Version/s: 1.14
Fix Version/s: 1.19
Component/s: fetcher
Labels:
None

Description

Fetcher should optionally (by default) suspend crawling by a configurable interval when fetching the robots.txt fails with a server errors (HTTP status code 5xx, esp. 503) following Google's spec:
5xx (server error)
Server errors are seen as temporary errors that result in a "full disallow" of crawling. The request is retried until a non-server-error HTTP result code is obtained. A 503 (Service Unavailable) error will result in fairly frequent retrying. To temporarily suspend crawling, it is recommended to serve a 503 HTTP result code. Handling of a permanent server error is undefined.

See also the draft robots.txt RFC, section "Unreachable status".

Crawler-commons robots rules already provide isDeferVisits to store this information (must be set from RobotRulesParser).

Attachments

Issue Links

links to

GitHub Pull Request #724

Activity

People

Assignee:: Sebastian Nagel

Reporter:: Sebastian Nagel

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 26/Apr/18 14:14

Updated:: 13/Mar/24 14:51

Resolved:: 18/Jan/22 07:24