Description
IndexerMapReduce does not remove gone and redirects via DB status, only fetch status. This means segments merged before we fixed SegmentMerger may contain records that do not have a correct status. For example, some pages are gone on the web, gone in the CrawlDB, gone in the segments. But merging those old segments could cause a older status to prevail, causing it to be indexed although the CrawlDB says it's gone.
Attachments
Attachments
Issue Links
- is related to
-
NUTCH-1707 DummyIndexingWriter
- Closed