[SPARK-38237] Introduce a new config to require all cluster keys on Aggregate - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.3.0
Fix Version/s: 3.3.0
Component/s: SQL, Structured Streaming
Labels:
None

Description

We still find HashClusteredDistribution be useful for batch query as well. For example, we had a case with lower parallelism than expected due to the fact ClusteredDistribution is used for aggregation which matches with HashPartitioning with sub-key groups (note that the technical parallelism also depends on "cardinality" - picking sub-key groups means having less cardinality).

We propose to introduce a new config to require all cluster keys on Aggregate, leveraging HashClusteredDistribution. That said, we propose to rename back HashClusteredDistribution with retaining NOTE for stateful operator. The distribution should not be still touched anyway due to the requirement of stateful operator, but can be co-used with batch case if needed.

Attachments

Issue Links

links to

[Github] Pull Request #35551 (HeartSaVioR)

[Github] Pull Request #35552 (HeartSaVioR)

[Github] Pull Request #35574 (c21)

Activity

People

Assignee:: Cheng Su

Reporter:: Jungtaek Lim

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 17/Feb/22 09:01

Updated:: 25/Feb/22 22:49

Resolved:: 25/Feb/22 22:48