Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
0.2.4
-
None
Description
The duplicate detection of Records during indexing works so long as the Row does not overflow. If the Row overflows the duplicate detection works within the buffered records only. Also due to the indeterminate ordering of Records within Rows during indexing this can cause the duplicate counts to be different between indexing jobs.
I proposed solution is to allow the user to specify actions to take during indexing:
CHOSE_ONE (Which would choose one of the duplicate Records being indexed)
CHOSE_ONE_AND_WRITE_OVERFLOW (Which would choose one of the duplicate Records being indexed and write the other records out to a known location)
ERROR (Which will cause the job to exit on a Record duplicate)
NOTE: Duplicate record detection is really to enforce rules inside of Blur and likely means that the inbound index data has not been had duplicate Records removed.