Details

Type: Bug

Status: Closed

Priority: Major

Resolution: Fixed

Affects Version/s: 0.10.0, 0.11.0

Fix Version/s: 0.12.0

Component/s: Statistics

Labels:None
Description
The current implementation of FlajoletMartin estimator to estimate the number of distinct values doesn't use hash functions that are pairwise independent. This is problematic because the input values don't distribute uniformly. When run on large TPCH data sets, this leads to a huge discrepancy for primary key columns. Primary key columns are typically a monotonically increasing sequence.
The fix is to use hash functions that are pairwise independent. More on pairwise independence and family of hash functions  http://people.csail.mit.edu/ronitt/COURSE/S12/handouts/lec5.pdf