Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1245

URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.4, 1.5
    • 1.7
    • None
    • None
    • Patch Available

    Description

      A document gone with 404 after db.fetch.interval.max (90 days) has passed
      is fetched over and over again but although fetch status is fetch_gone
      its status in CrawlDb keeps db_unfetched. Consequently, this document will
      be generated and fetched from now on in every cycle.

      To reproduce:

      1. create a CrawlDatum in CrawlDb which retry interval hits db.fetch.interval.max (I manipulated the shouldFetch() in AbstractFetchSchedule to achieve this)
      2. now this URL is fetched again
      3. but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 days)
      4. this does not change with every generate-fetch-update cycle, here for two segments:
        /tmp/testcrawl/segments/20120105161430
        SegmentReader: get 'http://localhost/page_gone'
        Crawl Generate::
        Status: 1 (db_unfetched)
        Fetch time: Thu Jan 05 16:14:21 CET 2012
        Modified time: Thu Jan 01 01:00:00 CET 1970
        Retries since fetch: 0
        Retry interval: 6998400 seconds (81 days)
        Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone
        
        Crawl Fetch::
        Status: 37 (fetch_gone)
        Fetch time: Thu Jan 05 16:14:48 CET 2012
        Modified time: Thu Jan 01 01:00:00 CET 1970
        Retries since fetch: 0
        Retry interval: 6998400 seconds (81 days)
        Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone
        
        
        /tmp/testcrawl/segments/20120105161631
        SegmentReader: get 'http://localhost/page_gone'
        Crawl Generate::
        Status: 1 (db_unfetched)
        Fetch time: Thu Jan 05 16:16:23 CET 2012
        Modified time: Thu Jan 01 01:00:00 CET 1970
        Retries since fetch: 0
        Retry interval: 6998400 seconds (81 days)
        Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone
        
        Crawl Fetch::
        Status: 37 (fetch_gone)
        Fetch time: Thu Jan 05 16:20:05 CET 2012
        Modified time: Thu Jan 01 01:00:00 CET 1970
        Retries since fetch: 0
        Retry interval: 6998400 seconds (81 days)
        Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone
        

      As far as I can see it's caused by setPageGoneSchedule() in AbstractFetchSchedule. Some pseudo-code:

      setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
          datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * maxInterval
          datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
          if (maxInterval < datum.fetchInterval) // necessarily true
             forceRefetch()
      
      forceRefetch:
          if (datum.fetchInterval > maxInterval) // true because it's 1.35 * maxInterval
             datum.fetchInterval = 0.9 * maxInterval
          datum.status = db_unfetched // 
      
      
      shouldFetch (called from generate / Generator.map):
          if ((datum.fetchTime - curTime) > maxInterval)
             // always true if the crawler is launched in short intervals
             // (lower than 0.35 * maxInterval)
             datum.fetchTime = curTime // forces a refetch
      

      After setPageGoneSchedule is called via update the state is db_unfetched and the retry interval 0.9 * db.fetch.interval.max (81 days).
      Although the fetch time in the CrawlDb is far in the future

      % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
      URL: http://localhost/page_gone
      Version: 7
      Status: 1 (db_unfetched)
      Fetch time: Sun May 06 05:20:05 CEST 2012
      Modified time: Thu Jan 01 01:00:00 CET 1970
      Retries since fetch: 0
      Retry interval: 6998400 seconds (81 days)
      Score: 1.0
      Signature: null
      Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
      

      the URL is generated again because (fetch time - current time) is larger than db.fetch.interval.max.
      The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and the fetch time is always close to current time + 1.35 * db.fetch.interval.max.

      It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578

      Attachments

        1. NUTCH-1245-1.patch
          0.9 kB
          Sebastian Nagel
        2. NUTCH-1245-2.patch
          1 kB
          Sebastian Nagel
        3. NUTCH-1245-578-TEST-1.patch
          16 kB
          Sebastian Nagel
        4. NUTCH-1245-578-TEST-2.patch
          17 kB
          Sebastian Nagel

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            snagel Sebastian Nagel
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment