Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1588

Port NUTCH-1245 URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again to 2.x

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • None
    • 2.3
    • None
    • None

    Description

      A document gone with 404 after db.fetch.interval.max (90 days) has passed
      is fetched over and over again but although fetch status is fetch_gone
      its status in CrawlDb keeps db_unfetched. Consequently, this document will
      be generated and fetched from now on in every cycle.

      To reproduce:

      1. create a CrawlDatum in CrawlDb which retry interval hits db.fetch.interval.max (I manipulated the shouldFetch() in AbstractFetchSchedule to achieve this)
      2. now this URL is fetched again
      3. but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 days)
      4. this does not change with every generate-fetch-update cycle, here for two segments:
        /tmp/testcrawl/segments/20120105161430
        SegmentReader: get 'http://localhost/page_gone'
        Crawl Generate::
        Status: 1 (db_unfetched)
        Fetch time: Thu Jan 05 16:14:21 CET 2012
        Modified time: Thu Jan 01 01:00:00 CET 1970
        Retries since fetch: 0
        Retry interval: 6998400 seconds (81 days)
        Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone
        
        Crawl Fetch::
        Status: 37 (fetch_gone)
        Fetch time: Thu Jan 05 16:14:48 CET 2012
        Modified time: Thu Jan 01 01:00:00 CET 1970
        Retries since fetch: 0
        Retry interval: 6998400 seconds (81 days)
        Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone
        
        
        /tmp/testcrawl/segments/20120105161631
        SegmentReader: get 'http://localhost/page_gone'
        Crawl Generate::
        Status: 1 (db_unfetched)
        Fetch time: Thu Jan 05 16:16:23 CET 2012
        Modified time: Thu Jan 01 01:00:00 CET 1970
        Retries since fetch: 0
        Retry interval: 6998400 seconds (81 days)
        Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone
        
        Crawl Fetch::
        Status: 37 (fetch_gone)
        Fetch time: Thu Jan 05 16:20:05 CET 2012
        Modified time: Thu Jan 01 01:00:00 CET 1970
        Retries since fetch: 0
        Retry interval: 6998400 seconds (81 days)
        Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone
        

      As far as I can see it's caused by setPageGoneSchedule() in AbstractFetchSchedule. Some pseudo-code:

      setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
          datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * maxInterval
          datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
          if (maxInterval < datum.fetchInterval) // necessarily true
             forceRefetch()
      
      forceRefetch:
          if (datum.fetchInterval > maxInterval) // true because it's 1.35 * maxInterval
             datum.fetchInterval = 0.9 * maxInterval
          datum.status = db_unfetched // 
      
      
      shouldFetch (called from generate / Generator.map):
          if ((datum.fetchTime - curTime) > maxInterval)
             // always true if the crawler is launched in short intervals
             // (lower than 0.35 * maxInterval)
             datum.fetchTime = curTime // forces a refetch
      

      After setPageGoneSchedule is called via update the state is db_unfetched and the retry interval 0.9 * db.fetch.interval.max (81 days).
      Although the fetch time in the CrawlDb is far in the future

      % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
      URL: http://localhost/page_gone
      Version: 7
      Status: 1 (db_unfetched)
      Fetch time: Sun May 06 05:20:05 CEST 2012
      Modified time: Thu Jan 01 01:00:00 CET 1970
      Retries since fetch: 0
      Retry interval: 6998400 seconds (81 days)
      Score: 1.0
      Signature: null
      Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
      

      the URL is generated again because (fetch time - current time) is larger than db.fetch.interval.max.
      The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and the fetch time is always close to current time + 1.35 * db.fetch.interval.max.

      It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578

      Attachments

        1. NUTCH-1588.patch
          2 kB
          Talat Uyarer
        2. NUTCH-1588-final.patch
          2 kB
          Talat Uyarer

        Issue Links

          Activity

            People

              Unassigned Unassigned
              lewismc Lewis John McGibbney
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: