Issue Details (XML | Word | Printable)

Key: NUTCH-371
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Andrzej Bialecki
Reporter: Chris Schneider
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Nutch

DeleteDuplicates should remove documents with duplicate URLs

Created: 25/Sep/06 04:14 PM   Updated: 16/Oct/06 09:16 PM
Return to search
Component/s: indexer
Affects Version/s: None
Fix Version/s: 0.9.0

Time Tracking:
Not Specified

File Attachments:
  Size
Java Source File Licensed for inclusion in ASF works DeleteDuplicates2.java 2006-10-05 08:39 AM Jim Kellerman 18 kB
Text File Licensed for inclusion in ASF works patch.txt 2006-10-05 11:20 PM Jim Kellerman 18 kB
Text File Licensed for inclusion in ASF works patch.txt 2006-10-05 12:10 AM Andrzej Bialecki 24 kB
Issue Links:
Reference
 

Resolution Date: 16/Oct/06 09:16 PM


 Description  « Hide
DeleteDuplicates is supposed to delete documents with duplicate URLs (after deleting documents with identical MD5 hashes), but this part is apparently not yet implemented. Here's the comment from DeleteDuplicates.java:

// 2. map indexes -> <<url, fetchdate>, <index,doc>>
// partition by url
// reduce, deleting all but most recent.
//
// Part 2 is not yet implemented, but the Indexer currently only indexes one
// URL per page, so this is not a critical problem.

It is apparently also known that re-fetching the same URL (e.g., one month later) will result in more than one document with the same URL (this is alluded to in NUTCH-95), but the comment above suggests that the indexer will solve the problem before DeleteDuplicates, because it will only index one document per URL.

This is not necessarily the case if the segments are to be divided among search servers, as each server will have its own index built from its own portion of the segments. Thus, if the URL in question was fetched in different segments, and these segments end up assigned to different search servers, then the indexer can't be relied on to eliminate the duplicates.

Thus, it seems like the second part of the DeleteDuplicates algorithm (i.e., deleting documents with duplicate URLs) needs to be implemented. I agree with Byron and Andrzej that the most recently fetched document (rather than the one with the highest score) should be preserved.

Finally, it's also possible to get duplicate URLs in the segments without re-fetching an expired URL in the crawldb. This can happen if 3 different URLs all redirect to the target URL. This is yet another consequence of handling redirections immediately, rather than adding the target URL to the crawldb for fetching in some subsequent segment (see NUTCH-273).



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Chris Schneider added a comment - 25/Sep/06 04:19 PM
The second part of the DeleteDuplicates algorithm must be implemented either as part of fixing NUTCH-95 or before doing so.

Chris Schneider added a comment - 25/Sep/06 04:21 PM
If redirections didn't cause the target URL to be fetched until the following segment, then all duplicate URLs in the segments would be the result of re-fetching expired pages.

Andrzej Bialecki added a comment - 03/Oct/06 07:48 PM
I think we need to change DeleteDuplicates to implement the following algorithm:

Step 1: delete URL duplicates, keeping the most recent document

Step 2: delete content duplicates, keeping the one with the highest score (or optionally the one with the shortest url?)

The order of these steps is important: first we need to ensure that we will keep the most recent versions of the pages - currently dedup removes by content hash first, which may delete newer documents and keep older ones ... oops. Indexer doesn't check this either - see NUTCH-378 for more details.

This requires storing fetchTime in the index, which automatically solves NUTCH-95.

The second step would keep the best scoring pages and discard all others. Or perhaps we should keep the shortest urls?

Finally, we really, really need a JUnit test for this - I already started writing one, stay tuned.


Andrzej Bialecki added a comment - 05/Oct/06 12:10 AM
This patch contains an alternative implementation of DeleteDuplicates, which follows the algorithm described in the previous comment. A new JUnit test was created to test this implementation.

Jim Kellerman added a comment - 05/Oct/06 02:35 AM
Let me copy my comments from Nutch-380 to here to explain why I linked it to this issue:

In Hadoop 0.6, JobConf.setInputKeyClass and JobConf.setInputValueClass are deprecated because the interface org.apache.hadoop.mapred.RecordReader has two new methods:

/**

  • Create an object of the appropriate type to be used as a key.
  • @return a new key object
    */
    WritableComparable createKey();

/**

  • Create an object of the appropriate type to be used as the value.
  • @return a new value object
    */
    Writable createValue();

This means that the key class and the value class need to be instantiable.

Making IndexDoc instantiable is not a big deal because it is always the same.
However, InputFormat is sometimes a Text and sometimes a MD5Hash in DeleteDuplicates2 (or in DeleteDuplicates, sometimes a UTF8 and sometimes a HashScore).

Since DeleteDuplicates(2).dedup knows what the key is for each phase, how about making two separate instantiable classes for the key classes and if sharing the code is that important, the classes can delegate to the static class?


Jim Kellerman added a comment - 05/Oct/06 08:39 AM
This version of DeleteDocuments2 compiles with both Hadoop 0.5 and Hadoop 0.6

Andrzej Bialecki added a comment - 05/Oct/06 09:46 AM
Thanks for investigating this. Regarding the updated version (please create diffs in the future): wouldn't it be easier to make createKey() an abstract method, which TestInputFormat and MD5HashInputFormat override, and then just use if (key instanceof MD5Hash) in RecordReader?

Jim Kellerman added a comment - 05/Oct/06 11:20 PM
> Andrzej Bialecki [05/Oct/06 02:46 AM] Thanks for investigating this. Regarding the
> updated version (please create diffs in the future): wouldn't it be easier to make
> createKey() an abstract method, which TextInputFormat and MD5HashInputFormat
> override, and then just use if (key instanceof MD5Hash) in RecordReader?

I have refactored it as you suggested so that there is as little duplicate code as possible. Patch attached.


Andrzej Bialecki added a comment - 16/Oct/06 09:16 PM
A modified version of this patch committed in rev. 464654 .