Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1646

IndexerMapReduce to consider DB status

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.7
    • 1.8
    • indexer
    • None
    • Patch Available

    Description

      IndexerMapReduce does not remove gone and redirects via DB status, only fetch status. This means segments merged before we fixed SegmentMerger may contain records that do not have a correct status. For example, some pages are gone on the web, gone in the CrawlDB, gone in the segments. But merging those old segments could cause a older status to prevail, causing it to be indexed although the CrawlDB says it's gone.

      Attachments

        1. NUTCH-1646.patch
          3 kB
          Markus Jelsma
        2. NUTCH-1646-2.patch
          3 kB
          Sebastian Nagel
        3. NUTCH-1646-3.patch
          4 kB
          Sebastian Nagel
        4. NUTCH-1646-trunk.patch
          2 kB
          Markus Jelsma

        Issue Links

          Activity

            People

              markus17 Markus Jelsma
              markus17 Markus Jelsma
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: