[IMPALA-11924] Bloom filter size is unaffected by column NDV - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: Impala 4.3.0
Component/s: Frontend
Labels:
- bloom-filter
- runtime-filters

Epic Color:
ghx-label-10

Description

For bloom filter sizing Impala simply uses the the cardinality of the build side while it could be clearly capped by NDV:
https://github.com/apache/impala/blob/feb4a76ed4cb5b688143eb21370f78ec93133c56/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java#L661
https://github.com/apache/impala/blob/feb4a76ed4cb5b688143eb21370f78ec93133c56/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java#L698

E.g.:

use tpch_parquet;
set RUNTIME_FILTER_MIN_SIZE=8192
RUNTIME_FILTER_MIN_SIZE
explain select count(*) from orders join customer on o_comment = c_mktsegment
PLAN-ROOT SINK
|
06:AGGREGATE [FINALIZE]
|  output: count:merge(*)
|  row-size=8B cardinality=1
|
05:EXCHANGE [UNPARTITIONED]
|
03:AGGREGATE
|  output: count(*)
|  row-size=8B cardinality=1
|
02:HASH JOIN [INNER JOIN, BROADCAST]
|  hash predicates: o_comment = c_mktsegment
|  runtime filters: RF000 <- c_mktsegment
|  row-size=82B cardinality=162.03K
|
|--04:EXCHANGE [BROADCAST]
|  |
|  01:SCAN HDFS [tpch_parquet.customer]
|     HDFS partitions=1/1 files=1 size=12.34MB
|     row-size=21B cardinality=150.00K
|
00:SCAN HDFS [tpch_parquet.orders]
   HDFS partitions=1/1 files=2 size=54.21MB
   runtime filters: RF000 -> o_comment
   row-size=61B cardinality=1.50M

The query above set RF000's size to 65536, while the minimum 8192 would be more than enough, as the ndv of c_mktsegment is 5

The current logic should work well for FK/PK joins where the build size's cardinality is close the PK's ndv, but can massively overestimate large tables with small ndv keys.

Attachments

Issue Links

is related to

IMPALA-9333 Potential runtime filter optimisations

Open

relates to

IMPALA-6311 Evaluate smaller FPP for Bloom filters

Resolved

Activity

People

Assignee:: Csaba Ringhofer

Reporter:: Csaba Ringhofer

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 16/Feb/23 15:31

Updated:: 27/Jun/24 00:35

Resolved:: 28/Jul/23 23:40