Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2540

Support Generic Deduplication in Nutch 2.x

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Auto Closed
    • 2.3.1
    • 2.5
    • indexer

    Description

      Currently, deduplication in 2.x exists only as a utility for the Solr index.

      My use-case for Nutch required deduplication so I wrote custom code that checks for duplicates based on digest and deletes them at index time. I figured I'd port the change so that others could use it as well.

      This is a very simple approach to Deduplication. There's plenty of room to improve it.

      This change adds a new DataStore for Duplicate entries that are just lists of urls with signatures as keys.

      A DeduplicatorJob can be run between the DbUpdatorJob and IndexingJob to map WebPages into the Duplicate DataStore.

      Since the key of the Duplicate store is the digest field of the WebPage store entries, duplicate matching can be configured via extension of the Signature abstract class.

      A new "-deduplicate" argument is added to the IndexingJob (false by default). If this flag is used, then the IndexingJob will check the Duplicate DataStore for duplicate URLs, run pluggable DuplicateFilters to determine which URL belongs to the original WebPage, and skip the WebPage if it is not the original, and delete (from the index) the other pages if the WebPage is the original.

      I've also added a BasicDuplicateFilter plugin class that considers the URL with the shortest path to be the original.

      Eventually, it would be best to consider things like score and fetch time when determining which WebPage to keep and which to remove.

      Attachments

        Activity

          People

            Unassigned Unassigned
            bvachon Ben Vachon
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 120h
                120h
                Remaining:
                Remaining Estimate - 120h
                120h
                Logged:
                Time Spent - Not Specified
                Not Specified