Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
Impala 4.2.0
-
None
-
ghx-label-4
Description
Impala planner select desired bloom filter size by estimating the NDV of values and target FPP (currently default at 0.75). Starting from IMPALA-11924, the NDV itself is estimated by taking the min between the input cardinality going to the join builder vs the column's stats NDV.
If Planner underestimate the input cardinality, it can select bloom filter size that is too small to fit the actual row NDV from the execution, rendering the filter ineffective (has big actual false-positive rate). Example of this case can be observed at RF004 of Q53 from TPC-DS 3TB run with RUNTIME_FILTER_MIN_SIZE=8KB (53.txt).
To be specific:
query | filter | column | stats NDV | est cardinality | selected size | actual cardinality | ndv based min size |
---|---|---|---|---|---|---|---|
Q53 | RF004 | i_item_sk | 360000 | 51 | 8KB (2^13) | 18.53K | 128KB (2^17) |
For RF004, the cardinality underestimation can be attributed to bad selectivity estimate in the build hand side of the join node producing that filters. The actual cardinality 18.53K is still within the limit of 8KB bloom filter size, but since the target FPP is 0.75, it still produce high actual false-positive rate, passing out more rows.
Getting better bloom filter size will require fixing this selectivity estimation, reducing target fpp lower than current default (0.75), or add an optimization to also consider stats NDV if cardinality estimate seems to be severely underestimated. 53_double_filter_size.txt shows that increasing RF004 size can lead to better row filtering.
Attachments
Attachments
Issue Links
- relates to
-
IMPALA-6311 Evaluate smaller FPP for Bloom filters
- Resolved