Description
In a long running continuous crawl with Nutch 1.4 URLs with a HTTP redirect (http.redirect.max = 0) are kept as not-modified in the CrawlDb. Short protocol (cf. attached dumped segment / CrawlDb data):
2012-02-23 : injected
2012-02-24 : fetched
2012-03-30 : re-fetched, signature changed
2012-04-20 : re-fetched, redirected
2012-04-24 : in CrawlDb as db_notmodified, still indexed with old content!
The signature of a previously fetched document is not reset when the URL/doc is changed to a redirect at a later time. CrawlDbReducer.reduce then sets the status to db_notmodified because the new signature in with fetch status is identical to the old one.
Possible fixes (??):
- reset the signature in Fetcher
- handle this case in CrawlDbReducer.reduce
Attachments
Attachments
Issue Links
- is related to
-
NUTCH-1502 Test for CrawlDatum state transitions
- Closed