Details
-
Bug
-
Status: Open
-
Critical
-
Resolution: Unresolved
-
1.6, 2.1
-
None
-
None
Description
In a continuous crawl with adaptive fetch scheduling documents not modified for a longer time may be fetched in every cycle.
A continous crawl is run daily with a 3 cycles and the following scheduling intervals (freshness matters):
db.fetch.schedule.class = org.apache.nutch.crawl.AdaptiveFetchSchedule db.fetch.schedule.adaptive.sync_delta = true (default) db.fetch.schedule.adaptive.sync_delta_rate = 0.3 (default) db.fetch.interval.default = 172800 (2 days) db.fetch.schedule.adaptive.min_interval = 86400 (1 day) db.fetch.schedule.adaptive.max_interval = 604800 (7 days) db.fetch.interval.max = 604800 (7 days)
At Apr 18 a URL is generated and fetched (from segment dump):
Crawl Generate:: Status: 2 (db_fetched) Fetch time: Mon Apr 15 19:43:22 CEST 2013 Modified time: Tue Mar 19 01:07:42 CET 2013 Retries since fetch: 0 Retry interval: 604800 seconds (7 days) Crawl Fetch:: Status: 33 (fetch_success) Fetch time: Thu Apr 18 01:23:51 CEST 2013 Modified time: Tue Mar 19 01:07:42 CET 2013 Retries since fetch: 0 Retry interval: 604800 seconds (7 days)
Running CrawlDb update results in a next fetch time in the past (which forces an immediate refetch in the next cycle):
Status: 6 (db_notmodified) Fetch time: Tue Apr 16 01:37:00 CEST 2013 Modified time: Tue Mar 19 01:07:42 CET 2013 Retries since fetch: 0 Retry interval: 604800 seconds (7 days)
This behavior is caused by the sync_delta calculation in AdaptiveFetchSchedule:
if (SYNC_DELTA) { // try to synchronize with the time of change long delta = (fetchTime - modifiedTime) / 1000L; if (delta > interval) interval = delta; refTime = fetchTime - Math.round(delta * SYNC_DELTA_RATE * 1000); } if (interval < MIN_INTERVAL) { interval = MIN_INTERVAL; } else if (interval > MAX_INTERVAL) { interval = MAX_INTERVAL; } ... datum.setFetchTime(refTime + Math.round(interval * 1000.0));
delta is 30 days (Apr 18 - Mar 19). refTime is then 9 days in the past (delta * 0.3). After adding interval (adjusted to MAX_INTERVAL = 7 days) to refTime the next fetch "should" take place 2 days in the past (Apr 16).
According to the javadoc (if understood right), there are two aims of the sync_delta if we know that a document hasn't been modified for long:
- increase the fetch interval immediately (not step by step)
- because we expect the document to be changed within the adaptive interval (but it hasn't), we shift the "reference time", i.e. we expect a change soon.
These two aims are somehow in contradiction. In any case, the next fetch time should be always within the range of (currentFetchTime + MIN_INTERVAL) and (currentFetchTime + MAX_INTERVAL) and never in the past.
This problem has been noted by pascaldimassimo in 1 and 2.
Attachments
Issue Links
- is related to
-
NUTCH-1502 Test for CrawlDatum state transitions
- Closed