[NUTCH-656] DeleteDuplicates based on crawlDB only - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Wish
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: indexer
Labels:
None

Description

The existing dedup functionality relies on Lucene indices and can't be used when the indexing is delegated to SOLR.
I was wondering whether we could use the information from the crawlDB instead to detect URLs to delete then do the deletions in an indexer-neutral way. As far as I understand the content of the crawlDB contains all the elements we need for dedup, namely :

URL
signature
fetch time
score

In map-reduce terms we would have two different jobs :

read crawlDB and compare on URLs : keep only most recent element - oldest are stored in a file and will be deleted later

read crawlDB and have a map function generating signatures as keys and URL + fetch time +score as value
reduce function would depend on which parameter is set (i.e. use signature or score) and would output as list of URLs to delete

This assumes that we can then use the URLs to identify documents in the indices.

Any thoughts on this? Am I missing something?

Julien

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NUTCH-656.patch
25/Sep/13 10:29
8 kB
Julien Nioche
NUTCH-656.v2.patch
19/Oct/13 13:35
13 kB
Julien Nioche
NUTCH-656.v3.patch
14/Nov/13 10:29
14 kB
Julien Nioche

Issue Links

relates to

NUTCH-1324 DupeDB for Nutch

Open

NUTCH-1047 Pluggable indexing backends

Closed

NUTCH-1688 Port DeleteDuplicate based on crawlDB only to 2.x

Closed

Activity

People

Assignee:: Julien Nioche

Reporter:: Julien Nioche

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 09/Oct/08 08:35

Updated:: 28/Jan/21 14:03

Resolved:: 14/Nov/13 11:56