Details
-
New Feature
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.1.0
Description
We can improve the performance of some joins by pre-filtering one side of a join using a Bloom filter and IN predicate generated from the values from the other side of the join.
For example:tpcds/q16.sql. Before this optimization. After this optimization.
Query Performance Benchmarks: TPC-DS Performance Evaluation
Our setup for running TPC-DS benchmark was as follows: TPC-DS 5T and Partitioned Parquet table
Query | Default(Seconds) | Enable Bloom Filter Join(Seconds) |
tpcds q16 | 84 | 46 |
tpcds q36 | 29 | 21 |
tpcds q57 | 39 | 28 |
tpcds q94 | 42 | 34 |
tpcds q95 | 306 | 288 |
Attachments
Attachments
Issue Links
- is related to
-
SPARK-39386 Flaky Test: BloomFilterAggregateQuerySuite
- Open
-
SPARK-34562 Leverage parquet bloom filters
- Resolved
-
SPARK-42628 Add a migration note for bloom filter join
- Resolved
- relates to
-
SPARK-38841 Enable Bloom filter join by default
- Resolved
- links to