Details

Type: Bug

Status: Closed

Priority: Major

Resolution: Fixed

Affects Version/s: 0.10.0, 0.11.0

Fix Version/s: 0.12.0

Component/s: Statistics

Labels:None
Description
The current implementation of FlajoletMartin estimator to estimate the number of distinct values doesn't use hash functions that are pairwise independent. This is problematic because the input values don't distribute uniformly. When run on large TPCH data sets, this leads to a huge discrepancy for primary key columns. Primary key columns are typically a monotonically increasing sequence.
Activity
 All
 Comments
 Work Log
 History
 Activity
 Transitions
The fix is to use hash functions that are pairwise independent. More on pairwise independence and family of hash functions  http://people.csail.mit.edu/ronitt/COURSE/S12/handouts/lec5.pdf