Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1422

bypass signature comparison when a document is redirected

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 1.4
    • 1.9
    • crawldb, fetcher
    • None
    • Patch Available

    Description

      In a long running continuous crawl with Nutch 1.4 URLs with a HTTP redirect (http.redirect.max = 0) are kept as not-modified in the CrawlDb. Short protocol (cf. attached dumped segment / CrawlDb data):
      2012-02-23 : injected
      2012-02-24 : fetched
      2012-03-30 : re-fetched, signature changed
      2012-04-20 : re-fetched, redirected
      2012-04-24 : in CrawlDb as db_notmodified, still indexed with old content!

      The signature of a previously fetched document is not reset when the URL/doc is changed to a redirect at a later time. CrawlDbReducer.reduce then sets the status to db_notmodified because the new signature in with fetch status is identical to the old one.

      Possible fixes (??):

      • reset the signature in Fetcher
      • handle this case in CrawlDbReducer.reduce

      Attachments

        1. NUTCH-1422-trunk-v2.patch
          7 kB
          Julien Nioche
        2. NUTCH-1422-trunk-v1.patch
          0.9 kB
          Sebastian Nagel
        3. NUTCH-1422_redir_notmodified_log.txt
          3 kB
          Sebastian Nagel

        Issue Links

          Activity

            People

              Unassigned Unassigned
              snagel Sebastian Nagel
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: