Details
-
Improvement
-
Status: Closed
-
Minor
-
Resolution: Implemented
-
None
-
None
Description
(see NUTCH-1927 and NUTCH-2803)
The aim of NUTCH-1927 was to make it possible to ignore the robots.txt for a defined set of hosts/domains. Ignoring the robots.txt entirely has some site effects which should be documented:
- undesired content (duplicates, private pages, etc.) may get indexed
- the Crawl-Delay is ignored
- no sitemaps are detected (cf.
NUTCH-2807)
Attachments
Issue Links
- links to