[SPARK-35703] Relax constraint for Spark bucket join and remove HashClusteredDistribution - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.2.0
Fix Version/s: 3.3.0
Component/s: SQL
Labels:
None

Description

Currently Spark has HashClusteredDistribution and ClusteredDistribution. The only difference between the two is that the former is more strict when deciding whether bucket join is allowed to avoid shuffle: comparing to the latter, it requires exact match between the clustering keys from the output partitioning (i.e., HashPartitioning) and the join keys. However, this is unnecessary, as we should be able to avoid shuffle when the set of clustering keys is a subset of join keys, just like ClusteredDistribution.

Attachments

Issue Links

causes

SPARK-40703 Performance regression for joins in Spark 3.3 vs Spark 3.2

Resolved

is depended upon by

SPARK-37375 Umbrella: Storage Partitioned Join (SPJ)

Resolved

links to

[Github] Pull Request #32875 (sunchao)

[Github] Pull Request #35138 (cloud-fan)

[Github] Pull Request #35225 (cloud-fan)

(1 links to)

Activity

People

Assignee:: Chao Sun

Reporter:: Chao Sun

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 09/Jun/21 19:23

Updated:: 07/Oct/22 18:14

Resolved:: 27/Dec/21 11:21