The SolrDeleteDuplicates class and associated function has been helpful for getting rid of duplicate pages from variations of URLs. However, there are some cases where the pages are very similar in terms of textual content but still need to be kept as distinct searchable pages.
Sometimes the textual content of two documents is very close so they would be counted as duplicates by the duplicate detector, but we may want both of them to be searchable.
For example for some products or resellers of our products, the webpage template is the same and only a small amount of text may differ between different products/resellers. Therefore some are counted as duplicates, but we want all to be included and searchable on our site, so people can find things by name, even if it is not in a key field.
We can manually specify which group of URLs these pages correspond to (via some regexes) to prevent them from being potentially deleted as duplicates.
As a result this provides a mechanism for manually excluding documents via ID from deduplication.
This patch adds an option to the configuration of nutch-site.xml, allowing users to specify a file containing a list of regular expressions with a new property "solr.exclude.from.dedup.regex.file":
The property specifies a file name containing a list of regular expressions, indicated by the line starting with "-"
-If any ID matches one of these expressions during the deduplication process, the document with that ID will be skipped
--I.e., it will not be subject to deduplication
Here is an example file: