[SPARK-42038] SPJ: Support partially clustered distribution - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.3.1
Fix Version/s: 3.4.0
Component/s: SQL
Labels:
None

Description

Currently the storage-partitioned join requires both sides to be fully clustered on the partition values, that is, all input partitions reported by a V2 data source shall be grouped by partition values before the join happens. This could lead to data skew issues if a particular partition value is associated with a large amount of rows.

To combat this, we can introduce the idea of partially clustered distribution, which means that only one side of the join is required to be fully clustered, while the other side is not. This allows Spark to increase the parallelism of the join and avoid the data skewness.

Attachments

Issue Links

links to

[Github] Pull Request #39633 (sunchao)

Activity

People

Assignee:: Chao Sun

Reporter:: Chao Sun

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 13/Jan/23 00:20

Updated:: 07/Feb/23 05:20

Resolved:: 07/Feb/23 05:20