The aim of
NUTCH-1927 was to make it possible to ignore the robots.txt for a defined set of hosts/domains. Ignoring the robots.txt entirely has some site effects which should be documented:
- undesired content (duplicates, private pages, etc.) may get indexed
- the Crawl-Delay is ignored
- no sitemaps are detected (cf.