[NUTCH-1830] Solr Delete Duplicates: Adding option to exclude IDs matching specified patterns - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: indexer
Labels:
- Solr
- dedupe
- nutch

Patch Info:

Patch Available

Description

The SolrDeleteDuplicates class and associated function has been helpful for getting rid of duplicate pages from variations of URLs. However, there are some cases where the pages are very similar in terms of textual content but still need to be kept as distinct searchable pages.

Sometimes the textual content of two documents is very close so they would be counted as duplicates by the duplicate detector, but we may want both of them to be searchable.

For example for some products or resellers of our products, the webpage template is the same and only a small amount of text may differ between different products/resellers. Therefore some are counted as duplicates, but we want all to be included and searchable on our site, so people can find things by name, even if it is not in a key field.

We can manually specify which group of URLs these pages correspond to (via some regexes) to prevent them from being potentially deleted as duplicates.

As a result this provides a mechanism for manually excluding documents via ID from deduplication.

This patch adds an option to the configuration of nutch-site.xml, allowing users to specify a file containing a list of regular expressions with a new property "solr.exclude.from.dedup.regex.file":

<property>
   <name>solr.exclude.from.dedup.regex.file</name>
   <value>regex-exclude-urls-from-dedup.txt</value>
   <description>
      Holds the file name of the file containing any regular expressions specifying URLs (ids) to be excluded from the Solr Deduplication process.
      I.e., any URL matching one of the regular expressions will not be subject to potential deduplication.
      Each pattern string must start on its own line with a "-" character at the beginning - all other lines will be ignored.
      Also, the URLs must match the entire pattern.
   </description>
</property>

The property specifies a file name containing a list of regular expressions, indicated by the line starting with "-"
-If any ID matches one of these expressions during the deduplication process, the document with that ID will be skipped
--I.e., it will not be subject to deduplication

Here is an example file:

#Allows specifying regular expressions for which any matching URLs
#will not be subjected to potential deduplication
#Requires regex strings to match full URL
#Each regex string must start with "-" all other lines are ignored.

#Excludeall reseller pages from deduplication:
-.*/company/reseller/.*

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

solr_delete_duplicates_add_exclusions.patch
27/Aug/14 19:28
6 kB
Brian
solr_delete_duplicates_add_exclusions_2.patch
03/Sep/14 19:27
6 kB
Brian
regex-exclude-urls-from-dedup.txt
27/Aug/14 19:44
0.3 kB
Brian

Activity

People

Assignee:: Unassigned

Reporter:: Brian

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 27/Aug/14 19:14

Updated:: 03/Sep/14 19:27