[SPARK-42689] Allow ShuffleDriverComponent to declare if shuffle data is reliably stored - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.5.0
Fix Version/s: 3.5.0
Component/s: Spark Core
Labels:
None

Description

Currently, if there is an executor node loss, we assume the shuffle data on that node is also lost. This is not necessarily the case if there is a shuffle component managing the shuffle data and reliably maintaining it (for example, in distributed filesystem or in a disaggregated shuffle cluster).

Downstream projects have patches to Apache Spark in order to workaround this issue, for example Apache Celeborn has this.

Attachments

Issue Links

links to

[Github] Pull Request #40307 (mridulm)

Activity

People

Assignee:: Mridul Muralidharan

Reporter:: Mridul Muralidharan

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 06/Mar/23 19:24

Updated:: 09/Mar/23 07:10

Resolved:: 09/Mar/23 07:10