Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.0.0
-
None
Description
This is intended to be a complement to SPARK-17370 which addressed Standalone mode only.
For Mesos, it seems we could enhance MesosExternalShuffleClient to detect if any of the external shuffle services is lost when sending heartbeats. In such case, the MesosCoarseGrainedSchedulerBackend can notify ExecutorLost with workerlost=true. Also it can put the slave where the external shuffle service run to the blacklist, preventing launching tasks further on it.
Attachments
Issue Links
- is related to
-
SPARK-17370 Shuffle service files not invalidated when a slave is lost
- Resolved
-
SPARK-20832 Standalone master should explicitly inform drivers of worker deaths and invalidate external shuffle service outputs
- Resolved