Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1830

Solr Delete Duplicates: Adding option to exclude IDs matching specified patterns

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • indexer
    • Patch Available

    Description

      The SolrDeleteDuplicates class and associated function has been helpful for getting rid of duplicate pages from variations of URLs. However, there are some cases where the pages are very similar in terms of textual content but still need to be kept as distinct searchable pages.

      Sometimes the textual content of two documents is very close so they would be counted as duplicates by the duplicate detector, but we may want both of them to be searchable.

      For example for some products or resellers of our products, the webpage template is the same and only a small amount of text may differ between different products/resellers. Therefore some are counted as duplicates, but we want all to be included and searchable on our site, so people can find things by name, even if it is not in a key field.

      We can manually specify which group of URLs these pages correspond to (via some regexes) to prevent them from being potentially deleted as duplicates.

      As a result this provides a mechanism for manually excluding documents via ID from deduplication.

      This patch adds an option to the configuration of nutch-site.xml, allowing users to specify a file containing a list of regular expressions with a new property "solr.exclude.from.dedup.regex.file":

      <property>
         <name>solr.exclude.from.dedup.regex.file</name>
         <value>regex-exclude-urls-from-dedup.txt</value>
         <description>
            Holds the file name of the file containing any regular expressions specifying URLs (ids) to be excluded from the Solr Deduplication process.
            I.e., any URL matching one of the regular expressions will not be subject to potential deduplication.
            Each pattern string must start on its own line with a "-" character at the beginning - all other lines will be ignored.
            Also, the URLs must match the entire pattern.
         </description>
      </property>
      

      The property specifies a file name containing a list of regular expressions, indicated by the line starting with "-"
      -If any ID matches one of these expressions during the deduplication process, the document with that ID will be skipped
      --I.e., it will not be subject to deduplication

      Here is an example file:

      #Allows specifying regular expressions for which any matching URLs
      #will not be subjected to potential deduplication
      #Requires regex strings to match full URL
      #Each regex string must start with "-" all other lines are ignored.
      
      #Excludeall reseller pages from deduplication:
      -.*/company/reseller/.*
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            brian44 Brian
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: