SPARK-8425 will allow blacklisting faulty executors and nodes. However, these blacklisted executors will continue to run. This is bad for a few reasons:
(1) Even if there is faulty-hardware, if the cluster is under-utilized spark may be able to request another executor on a different node.
(2) If there is a faulty-disk (the most common case of faulty-hardware), the cluster manager may be able to allocate another executor on the same node, if it can exclude the bad disk. (Yarn will do this with its disk-health checker.)
With dynamic allocation, this may seem less critical, as a blacklisted executor will stop running new tasks and eventually get reclaimed. However, if there is cached data on those executors, they will not get killed till spark.dynamicAllocation.cachedExecutorIdleTimeout expires, which is (effectively) infinite by default.
Users may not always want to kill bad executors, so this must be configurable to some extent. At a minimum, it should be possible to enable / disable it; perhaps the executor should be killed after it has been blacklisted a configurable N times.