Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
ghx-label-10
Description
For bloom filter sizing Impala simply uses the the cardinality of the build side while it could be clearly capped by NDV:
https://github.com/apache/impala/blob/feb4a76ed4cb5b688143eb21370f78ec93133c56/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java#L661
https://github.com/apache/impala/blob/feb4a76ed4cb5b688143eb21370f78ec93133c56/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java#L698
E.g.:
use tpch_parquet; set RUNTIME_FILTER_MIN_SIZE=8192 RUNTIME_FILTER_MIN_SIZE explain select count(*) from orders join customer on o_comment = c_mktsegment PLAN-ROOT SINK | 06:AGGREGATE [FINALIZE] | output: count:merge(*) | row-size=8B cardinality=1 | 05:EXCHANGE [UNPARTITIONED] | 03:AGGREGATE | output: count(*) | row-size=8B cardinality=1 | 02:HASH JOIN [INNER JOIN, BROADCAST] | hash predicates: o_comment = c_mktsegment | runtime filters: RF000 <- c_mktsegment | row-size=82B cardinality=162.03K | |--04:EXCHANGE [BROADCAST] | | | 01:SCAN HDFS [tpch_parquet.customer] | HDFS partitions=1/1 files=1 size=12.34MB | row-size=21B cardinality=150.00K | 00:SCAN HDFS [tpch_parquet.orders] HDFS partitions=1/1 files=2 size=54.21MB runtime filters: RF000 -> o_comment row-size=61B cardinality=1.50M
The query above set RF000's size to 65536, while the minimum 8192 would be more than enough, as the ndv of c_mktsegment is 5
The current logic should work well for FK/PK joins where the build size's cardinality is close the PK's ndv, but can massively overestimate large tables with small ndv keys.
Attachments
Issue Links
- is related to
-
IMPALA-9333 Potential runtime filter optimisations
- Open
- relates to
-
IMPALA-6311 Evaluate smaller FPP for Bloom filters
- Resolved