Description
Currently Spark has HashClusteredDistribution and ClusteredDistribution. The only difference between the two is that the former is more strict when deciding whether bucket join is allowed to avoid shuffle: comparing to the latter, it requires exact match between the clustering keys from the output partitioning (i.e., HashPartitioning) and the join keys. However, this is unnecessary, as we should be able to avoid shuffle when the set of clustering keys is a subset of join keys, just like ClusteredDistribution.
Attachments
Issue Links
- causes
-
SPARK-40703 Performance regression for joins in Spark 3.3 vs Spark 3.2
- Resolved
- is depended upon by
-
SPARK-37375 Umbrella: Storage Partitioned Join (SPJ)
- Resolved
- links to