Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.5.0
-
None
Description
Currently, if there is an executor node loss, we assume the shuffle data on that node is also lost. This is not necessarily the case if there is a shuffle component managing the shuffle data and reliably maintaining it (for example, in distributed filesystem or in a disaggregated shuffle cluster).
Downstream projects have patches to Apache Spark in order to workaround this issue, for example Apache Celeborn has this.