Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.1.0
-
None
Description
For broadcast inner-joins, where the smaller relation is known to be small enough to materialize on a worker, the set of values for all join columns is known and fits in memory. Spark should translate these values into a Filter pushed down to the datasource. The common join condition of equality, i.e. lhs.a == rhs.a, can be written as an a in ... clause. An example of pushing such filters is already present in the form of IsNotNull filters via sameerag's work on SPARK-12957 subtasks.
This optimization could even work when the smaller relation does not fit entirely in memory. This could be done by partitioning the smaller relation into N pieces, applying this predicate pushdown for each piece, and unioning the results.
Attachments
Issue Links
- is superceded by
-
SPARK-39753 Broadcast joins should pushdown join constraints as Filter to the larger relation
- Open
- relates to
-
SPARK-12957 Derive and propagate data constrains in logical plan
- Resolved