Details
-
Sub-task
-
Status: Resolved
-
Minor
-
Resolution: Incomplete
-
2.0.0
-
None
Description
In case of bucketed tables, Spark will avoid doing `Sort` and `Exchange` if the input tables and output table has same number of buckets. However, unequal bucketing will always lead to `Sort` and `Exchange`. If the number of buckets in the output table is a factor of the buckets in the input table, we should be able to avoid `Sort` and `Exchange` and directly join those.
eg.
Assume Input1, Input2 and Output be bucketed + sorted tables over the same columns but with different number of buckets. Input1 has 8 buckets, Input1 has 4 buckets and Output has 4 buckets. Since hash-partitioning is done using Modulus, if we JOIN buckets (0, 4) of Input1 and buckets (0, 4, 8) of Input2 in the same task, it would give the bucket 0 of output table.
Input1 (0, 4) (1, 3) (2, 5) (3, 7) Input2 (0, 4, 8) (1, 3, 9) (2, 5, 10) (3, 7, 11) Output (0) (1) (2) (3)
Attachments
Issue Links
- is related to
-
SPARK-23839 consider bucket join in cost-based JoinReorder rule
- Resolved
- relates to
-
SPARK-24025 Join of bucketed and non-bucketed tables can give two exchanges and sorts for non-bucketed side
- Resolved