[SPARK-17570] Avoid Hash and Exchange in Sort Merge join if bucketing factor is multiple for tables - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Minor
Resolution: Incomplete
Affects Version/s: 2.0.0
Fix Version/s: None
Component/s: SQL
Labels:
- bulk-closed

Description

In case of bucketed tables, Spark will avoid doing `Sort` and `Exchange` if the input tables and output table has same number of buckets. However, unequal bucketing will always lead to `Sort` and `Exchange`. If the number of buckets in the output table is a factor of the buckets in the input table, we should be able to avoid `Sort` and `Exchange` and directly join those.
eg.

Assume Input1, Input2 and Output be bucketed + sorted tables over the same columns but with different number of buckets. Input1 has 8 buckets, Input1 has 4 buckets and Output has 4 buckets. Since hash-partitioning is done using Modulus, if we JOIN buckets (0, 4) of Input1 and buckets (0, 4, 8) of Input2 in the same task, it would give the bucket 0 of output table.

Input1   (0, 4)      (1, 3)      (2, 5)       (3, 7)
Input2   (0, 4, 8)   (1, 3, 9)   (2, 5, 10)   (3, 7, 11)
Output   (0)         (1)         (2)          (3)

Attachments

Issue Links

is related to

SPARK-23839 consider bucket join in cost-based JoinReorder rule

Resolved

relates to

SPARK-24025 Join of bucketed and non-bucketed tables can give two exchanges and sorts for non-bucketed side

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Tejas Patil

Votes:: 6 Vote for this issue

Watchers:: 19 Start watching this issue

Dates

Created:: 16/Sep/16 23:02

Updated:: 08/Oct/19 05:43

Resolved:: 08/Oct/19 05:43