[NUTCH-420] DeleteDuplicates.HashPartitioner depends on the order of IndexDocs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 0.9.0
Fix Version/s: 0.9.0
Component/s: indexer
Labels:
None

Description

DeleteDuplicates.HashPartitioner.reduce():

// byScore case
if (value.score > highest.score) {
highest.keep = false;
LOG.debug("-discard " + highest + ", keep " + value);
output.collect(highest.url, highest); // delete highest
highest = value;
}
// !byScore is also similar

So assume two docs with same hash are in values.If the first has higher score than the second than second doc will be deleted. But if the first has lower score than the second then none will be deleted. AFAICS, there should be an else condition to delete value and keep highest as it is.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

index.tar.gz
08/Jan/07 15:27
0.9 kB
Dogacan Guney
dedup-v3.patch
09/Jan/07 08:40
5 kB
Dogacan Guney
dedup-v2.patch
04/Jan/07 09:29
2 kB
Dogacan Guney
dedup.patch
26/Dec/06 11:31
2 kB
Dogacan Guney

Activity

People

Assignee:: Andrzej Bialecki

Reporter:: Dogacan Guney

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 26/Dec/06 11:30

Updated:: 11/Jan/07 22:02

Resolved:: 11/Jan/07 22:02