[BLUR-422] Random duplicate detection during row overflow - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 0.2.4
Fix Version/s: 0.2.4
Component/s: Blur MapReduce
Labels:
None

Description

The duplicate detection of Records during indexing works so long as the Row does not overflow. If the Row overflows the duplicate detection works within the buffered records only. Also due to the indeterminate ordering of Records within Rows during indexing this can cause the duplicate counts to be different between indexing jobs.

I proposed solution is to allow the user to specify actions to take during indexing:

CHOSE_ONE (Which would choose one of the duplicate Records being indexed)
CHOSE_ONE_AND_WRITE_OVERFLOW (Which would choose one of the duplicate Records being indexed and write the other records out to a known location)
ERROR (Which will cause the job to exit on a Record duplicate)

NOTE: Duplicate record detection is really to enforce rules inside of Blur and likely means that the inbound index data has not been had duplicate Records removed.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Aaron McCurry

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 19/Mar/15 20:12

Updated:: 19/Mar/15 20:12