[SPARK-33574] Improve locality for push-based shuffle especially for join like operations - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.1.0
Fix Version/s: None
Component/s: Shuffle, Spark Core
Labels:
None

Description

Currently, we only set locality for ShuffledRDD and ShuffledRowRDD with push-based shuffle.

In simple stage DAGs where a ShuffledRDD or ShuffledRowRDD is the only input RDD, Spark can handle locality fine. However, if we have a join operation where a stage can consume multiple shuffle inputs or other non-shuffle inputs, the locality will take a hit with how DAGScheduler currently determines the preferred location.

With push-based shuffle, we could potentially reuse the same set of merger locations across sibling ShuffleMapStages. This would enable a much better locality on the reducer stage side, where corresponding merged shuffle partitions for the multiple shuffle inputs are already colocated.

Attachments

Issue Links

links to

[Github] Pull Request #34500 (rmcyang)

Activity

People

Assignee:: Unassigned

Reporter:: Min Shen

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 26/Nov/20 18:07

Updated:: 06/Nov/21 05:37