|
If redirections didn't cause the target URL to be fetched until the following segment, then all duplicate URLs in the segments would be the result of re-fetching expired pages.
I think we need to change DeleteDuplicates to implement the following algorithm:
Step 1: delete URL duplicates, keeping the most recent document Step 2: delete content duplicates, keeping the one with the highest score (or optionally the one with the shortest url?) The order of these steps is important: first we need to ensure that we will keep the most recent versions of the pages - currently dedup removes by content hash first, which may delete newer documents and keep older ones ... oops. Indexer doesn't check this either - see This requires storing fetchTime in the index, which automatically solves The second step would keep the best scoring pages and discard all others. Or perhaps we should keep the shortest urls? Finally, we really, really need a JUnit test for this - I already started writing one, stay tuned. This patch contains an alternative implementation of DeleteDuplicates, which follows the algorithm described in the previous comment. A new JUnit test was created to test this implementation.
Let me copy my comments from Nutch-380 to here to explain why I linked it to this issue:
In Hadoop 0.6, JobConf.setInputKeyClass and JobConf.setInputValueClass are deprecated because the interface org.apache.hadoop.mapred.RecordReader has two new methods: /**
/**
This means that the key class and the value class need to be instantiable. Making IndexDoc instantiable is not a big deal because it is always the same. Since DeleteDuplicates(2).dedup knows what the key is for each phase, how about making two separate instantiable classes for the key classes and if sharing the code is that important, the classes can delegate to the static class? This version of DeleteDocuments2 compiles with both Hadoop 0.5 and Hadoop 0.6
Thanks for investigating this. Regarding the updated version (please create diffs in the future): wouldn't it be easier to make createKey() an abstract method, which TestInputFormat and MD5HashInputFormat override, and then just use if (key instanceof MD5Hash) in RecordReader?
> Andrzej Bialecki [05/Oct/06 02:46 AM] Thanks for investigating this. Regarding the
> updated version (please create diffs in the future): wouldn't it be easier to make > createKey() an abstract method, which TextInputFormat and MD5HashInputFormat > override, and then just use if (key instanceof MD5Hash) in RecordReader? I have refactored it as you suggested so that there is as little duplicate code as possible. Patch attached. A modified version of this patch committed in rev. 464654 .
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
NUTCH-95or before doing so.