Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17519

[MESOS] Enhance robustness when ExternalShuffleService is broken

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.0.0
    • None
    • Mesos

    Description

      This is intended to be a complement to SPARK-17370 which addressed Standalone mode only.
      For Mesos, it seems we could enhance MesosExternalShuffleClient to detect if any of the external shuffle services is lost when sending heartbeats. In such case, the MesosCoarseGrainedSchedulerBackend can notify ExecutorLost with workerlost=true. Also it can put the slave where the external shuffle service run to the blacklist, preventing launching tasks further on it.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              sunrui Sun Rui
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: