Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16554

Spark should kill executors when they are blacklisted

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.2.0
    • Scheduler, Spark Core
    • None

    Description

      SPARK-8425 will allow blacklisting faulty executors and nodes. However, these blacklisted executors will continue to run. This is bad for a few reasons:

      (1) Even if there is faulty-hardware, if the cluster is under-utilized spark may be able to request another executor on a different node.

      (2) If there is a faulty-disk (the most common case of faulty-hardware), the cluster manager may be able to allocate another executor on the same node, if it can exclude the bad disk. (Yarn will do this with its disk-health checker.)

      With dynamic allocation, this may seem less critical, as a blacklisted executor will stop running new tasks and eventually get reclaimed. However, if there is cached data on those executors, they will not get killed till spark.dynamicAllocation.cachedExecutorIdleTimeout expires, which is (effectively) infinite by default.

      Users may not always want to kill bad executors, so this must be configurable to some extent. At a minimum, it should be possible to enable / disable it; perhaps the executor should be killed after it has been blacklisted a configurable N times.

      Attachments

        Issue Links

          Activity

            People

              jsoltren Jose Soltren
              irashid Imran Rashid
              Imran Rashid Imran Rashid
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: