Uploaded image for project: 'Apache Blur'
  1. Apache Blur
  2. BLUR-422

Random duplicate detection during row overflow

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 0.2.4
    • 0.2.4
    • Blur MapReduce
    • None

    Description

      The duplicate detection of Records during indexing works so long as the Row does not overflow. If the Row overflows the duplicate detection works within the buffered records only. Also due to the indeterminate ordering of Records within Rows during indexing this can cause the duplicate counts to be different between indexing jobs.

      I proposed solution is to allow the user to specify actions to take during indexing:

      CHOSE_ONE (Which would choose one of the duplicate Records being indexed)
      CHOSE_ONE_AND_WRITE_OVERFLOW (Which would choose one of the duplicate Records being indexed and write the other records out to a known location)
      ERROR (Which will cause the job to exit on a Record duplicate)

      NOTE: Duplicate record detection is really to enforce rules inside of Blur and likely means that the inbound index data has not been had duplicate Records removed.

      Attachments

        Activity

          People

            Unassigned Unassigned
            amccurry Aaron McCurry
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: