Description
The patch attached allows to follow one level of redirection for robots.txt files. A similar issue was mentioned in NUTCH-124 and has been marked as fixed a long time ago but the problem remained, at least when using Fetcher2 . Mathijs Homminga pointed to the problem in a mail to the nutch-dev list in March.
I have been using this patch for a while now on a large cluster and noticed that the ratio of robots_denied per fetchlist went up, meaning that at least we are now getting restrictions we would not have had before (and getting less complaints from webmasters at the same time)
Attachments
Attachments
Issue Links
- is related to
-
NUTCH-2581 Caching of redirected robots.txt may overwrite correct robots.txt rules
- Closed
-
NUTCH-2646 CLONE - Caching of redirected robots.txt may overwrite correct robots.txt rules
- Closed