[SPARK-27227] Spark Runtime Filter - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.1.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

When we equi-join one big table with a smaller table, we can collect some statistics from the smaller table side, and use it to the scan of big table to do partition prune or data filter before execute the join.
This can significantly improve SQL perfermance.

For a simple example:
select * from A, B where A.a = B.b
A is big table ,B is small table.

There are two scenarios:
1. A.a is a partition column of table A
we can collect all the values of B.b, and send it to table A to do
partition prune on A.a.
2. A.a is not a partition column of table A
we can collect real-time some statistics(such as min/max/bloomfilter) of B.b by execute extra sql(select max(b),min(b),bbf(b) from B), and send it to table A to do filter on A.a.
Addititionaly, if a more complex query select * from A join (select * from B where B.c = 1) X on A.a = B.b, then we collect real-time statistics(such as min/max/bloomfilter) of X by execute extra sql(select max(b),min(b),bbf(b) from X)

Above two scenarios, we can filter out lots of data by partition prune or data filter, thus we can imporve perfermance.

10TB TPC-DS gain about 35% improvement in our test.

I will submit a SPIP later.

SPIP: https://docs.google.com/document/d/1hTXxsG_qLu5W_VrVvPx2gumXEFrnVPhQSRMjZ6WkfiY/edit#heading=h.7vhjx9226jbt

Attachments

Issue Links

is related to

SPARK-28888 Implement Dynamic Partition Pruning

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Song Jun

Votes:: 5 Vote for this issue

Watchers:: 14 Start watching this issue

Dates

Created:: 21/Mar/19 12:12

Updated:: 16/Mar/20 22:50