[NUTCH-1564] AdaptiveFetchSchedule: sync_delta forces immediate refetch for documents not modified - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Critical
Resolution: Unresolved
Affects Version/s: 1.6, 2.1
Fix Version/s: None
Component/s: crawldb
Labels:
None

Description

In a continuous crawl with adaptive fetch scheduling documents not modified for a longer time may be fetched in every cycle.

A continous crawl is run daily with a 3 cycles and the following scheduling intervals (freshness matters):

db.fetch.schedule.class = org.apache.nutch.crawl.AdaptiveFetchSchedule
db.fetch.schedule.adaptive.sync_delta   = true (default)
db.fetch.schedule.adaptive.sync_delta_rate = 0.3 (default)
db.fetch.interval.default               = 172800 (2 days)
db.fetch.schedule.adaptive.min_interval =  86400 (1 day)
db.fetch.schedule.adaptive.max_interval = 604800 (7 days)
db.fetch.interval.max                   = 604800 (7 days)

At Apr 18 a URL is generated and fetched (from segment dump):

Crawl Generate::
Status: 2 (db_fetched)
Fetch time: Mon Apr 15 19:43:22 CEST 2013
Modified time: Tue Mar 19 01:07:42 CET 2013
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)

Crawl Fetch::
Status: 33 (fetch_success)
Fetch time: Thu Apr 18 01:23:51 CEST 2013
Modified time: Tue Mar 19 01:07:42 CET 2013
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)

Running CrawlDb update results in a next fetch time in the past (which forces an immediate refetch in the next cycle):

Status: 6 (db_notmodified)
Fetch time: Tue Apr 16 01:37:00 CEST 2013
Modified time: Tue Mar 19 01:07:42 CET 2013
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)

This behavior is caused by the sync_delta calculation in AdaptiveFetchSchedule:

  if (SYNC_DELTA) {
    // try to synchronize with the time of change
    long delta = (fetchTime - modifiedTime) / 1000L;
    if (delta > interval) interval = delta;
    refTime = fetchTime - Math.round(delta * SYNC_DELTA_RATE * 1000);
  }
  if (interval < MIN_INTERVAL) {
    interval = MIN_INTERVAL;
  } else if (interval > MAX_INTERVAL) {
    interval = MAX_INTERVAL;
  }
...
datum.setFetchTime(refTime + Math.round(interval * 1000.0));

delta is 30 days (Apr 18 - Mar 19). refTime is then 9 days in the past (delta * 0.3). After adding interval (adjusted to MAX_INTERVAL = 7 days) to refTime the next fetch "should" take place 2 days in the past (Apr 16).

According to the javadoc (if understood right), there are two aims of the sync_delta if we know that a document hasn't been modified for long:

increase the fetch interval immediately (not step by step)
because we expect the document to be changed within the adaptive interval (but it hasn't), we shift the "reference time", i.e. we expect a change soon.

These two aims are somehow in contradiction. In any case, the next fetch time should be always within the range of (currentFetchTime + MIN_INTERVAL) and (currentFetchTime + MAX_INTERVAL) and never in the past.

This problem has been noted by pascaldimassimo in 1 and 2.

Attachments

Issue Links

is related to

NUTCH-1502 Test for CrawlDatum state transitions

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Sebastian Nagel

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 19/Apr/13 12:11

Updated:: 15/Jul/14 09:25