Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1564

AdaptiveFetchSchedule: sync_delta forces immediate refetch for documents not modified

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 1.6, 2.1
    • None
    • crawldb
    • None

    Description

      In a continuous crawl with adaptive fetch scheduling documents not modified for a longer time may be fetched in every cycle.

      A continous crawl is run daily with a 3 cycles and the following scheduling intervals (freshness matters):

      db.fetch.schedule.class = org.apache.nutch.crawl.AdaptiveFetchSchedule
      db.fetch.schedule.adaptive.sync_delta   = true (default)
      db.fetch.schedule.adaptive.sync_delta_rate = 0.3 (default)
      db.fetch.interval.default               = 172800 (2 days)
      db.fetch.schedule.adaptive.min_interval =  86400 (1 day)
      db.fetch.schedule.adaptive.max_interval = 604800 (7 days)
      db.fetch.interval.max                   = 604800 (7 days)
      

      At Apr 18 a URL is generated and fetched (from segment dump):

      Crawl Generate::
      Status: 2 (db_fetched)
      Fetch time: Mon Apr 15 19:43:22 CEST 2013
      Modified time: Tue Mar 19 01:07:42 CET 2013
      Retries since fetch: 0
      Retry interval: 604800 seconds (7 days)
      
      Crawl Fetch::
      Status: 33 (fetch_success)
      Fetch time: Thu Apr 18 01:23:51 CEST 2013
      Modified time: Tue Mar 19 01:07:42 CET 2013
      Retries since fetch: 0
      Retry interval: 604800 seconds (7 days)
      

      Running CrawlDb update results in a next fetch time in the past (which forces an immediate refetch in the next cycle):

      Status: 6 (db_notmodified)
      Fetch time: Tue Apr 16 01:37:00 CEST 2013
      Modified time: Tue Mar 19 01:07:42 CET 2013
      Retries since fetch: 0
      Retry interval: 604800 seconds (7 days)
      

      This behavior is caused by the sync_delta calculation in AdaptiveFetchSchedule:

        if (SYNC_DELTA) {
          // try to synchronize with the time of change
          long delta = (fetchTime - modifiedTime) / 1000L;
          if (delta > interval) interval = delta;
          refTime = fetchTime - Math.round(delta * SYNC_DELTA_RATE * 1000);
        }
        if (interval < MIN_INTERVAL) {
          interval = MIN_INTERVAL;
        } else if (interval > MAX_INTERVAL) {
          interval = MAX_INTERVAL;
        }
      ...
      datum.setFetchTime(refTime + Math.round(interval * 1000.0));
      

      delta is 30 days (Apr 18 - Mar 19). refTime is then 9 days in the past (delta * 0.3). After adding interval (adjusted to MAX_INTERVAL = 7 days) to refTime the next fetch "should" take place 2 days in the past (Apr 16).

      According to the javadoc (if understood right), there are two aims of the sync_delta if we know that a document hasn't been modified for long:

      • increase the fetch interval immediately (not step by step)
      • because we expect the document to be changed within the adaptive interval (but it hasn't), we shift the "reference time", i.e. we expect a change soon.

      These two aims are somehow in contradiction. In any case, the next fetch time should be always within the range of (currentFetchTime + MIN_INTERVAL) and (currentFetchTime + MAX_INTERVAL) and never in the past.

      This problem has been noted by pascaldimassimo in 1 and 2.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              snagel Sebastian Nagel
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: