Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-5633

Bloom filters underestimate false positive probability

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Perf Investigation
    • Labels:
      None
    • Epic Color:
      ghx-label-1

      Description

      Block Bloom filters have a higher false positive rate than standard Bloom filter, due to the uneven distribution of keys between buckets. We should change the code to match the theory, using an approximation from the paper that introduced block Bloom filters, "Cache-, Hash- and Space-Efficient Bloom Filters" by Putze et al.

      For a false positive probability of 1%, this would increase filter size by about 10% and a decrease filter false positive probability by 50%. However, this is obscured by the coarseness of the fact that filters are constrained to have a size in bytes that is a power of two. Loosening that restriction is potential future work.

      See https://github.com/apache/kudu/commit/d1190c2b06a6eef91b21fd4a0b5fb76534b4e9f9

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                jbapple Jim Apple
                Reporter:
                jbapple Jim Apple
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated: