Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.6.2
-
None
Description
On YARN, its possible that a node is messed or misconfigured such that a container won't launch on it. For instance if the Spark external shuffle handler didn't get loaded on it , maybe its just some other hardware issue or hadoop configuration issue.
It would be nice we could recognize this happening and stop trying to launch executors on it since that could end up causing us to hit our max number of executor failures and then kill the job.
Attachments
Issue Links
- causes
-
SPARK-29683 Job failed due to executor failures all available nodes are blacklisted
- Resolved
- relates to
-
SPARK-19755 Blacklist is always active for MesosCoarseGrainedSchedulerBackend. As result - scheduler cannot create an executor after some time.
- Resolved
-
SPARK-24567 nodeBlacklist does not get updated if a spark executor fails to launch on a mesos node
- Resolved
-
SPARK-23485 Kubernetes should support node blacklist
- Reopened
- links to