Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
None
-
None
-
None
Description
A document gone with 404 after db.fetch.interval.max (90 days) has passed
is fetched over and over again but although fetch status is fetch_gone
its status in CrawlDb keeps db_unfetched. Consequently, this document will
be generated and fetched from now on in every cycle.
To reproduce:
- create a CrawlDatum in CrawlDb which retry interval hits db.fetch.interval.max (I manipulated the shouldFetch() in AbstractFetchSchedule to achieve this)
- now this URL is fetched again
- but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 days)
- this does not change with every generate-fetch-update cycle, here for two segments:
/tmp/testcrawl/segments/20120105161430 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:14:21 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:14:48 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone /tmp/testcrawl/segments/20120105161631 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:16:23 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:20:05 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone
As far as I can see it's caused by setPageGoneSchedule() in AbstractFetchSchedule. Some pseudo-code:
setPageGoneSchedule (called from update / CrawlDbReducer.reduce): datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * maxInterval datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 if (maxInterval < datum.fetchInterval) // necessarily true forceRefetch() forceRefetch: if (datum.fetchInterval > maxInterval) // true because it's 1.35 * maxInterval datum.fetchInterval = 0.9 * maxInterval datum.status = db_unfetched // shouldFetch (called from generate / Generator.map): if ((datum.fetchTime - curTime) > maxInterval) // always true if the crawler is launched in short intervals // (lower than 0.35 * maxInterval) datum.fetchTime = curTime // forces a refetch
After setPageGoneSchedule is called via update the state is db_unfetched and the retry interval 0.9 * db.fetch.interval.max (81 days).
Although the fetch time in the CrawlDb is far in the future
% nutch readdb testcrawl/crawldb -url http://localhost/page_gone URL: http://localhost/page_gone Version: 7 Status: 1 (db_unfetched) Fetch time: Sun May 06 05:20:05 CEST 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Score: 1.0 Signature: null Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
the URL is generated again because (fetch time - current time) is larger than db.fetch.interval.max.
The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and the fetch time is always close to current time + 1.35 * db.fetch.interval.max.
It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578
Attachments
Attachments
Issue Links
- relates to
-
NUTCH-1245 URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
- Closed