Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.1.0
-
None
-
mesos, marathon, docker - driver and executors are dockerized.
Description
When for some reason task fails - MesosCoarseGrainedSchedulerBackend increased failure counter for a slave where that task was running.
When counter is >=2 (MAX_SLAVE_FAILURES) mesos slave is excluded.
Over time scheduler cannot create a new executor - every slave is is in the blacklist. Task failure not necessary related to host health- especially for long running stream apps.
If accepted as a bug: possible solution is to use: spark.blacklist.enabled to make that functionality optional and if it make sense MAX_SLAVE_FAILURES also can be configurable.
Attachments
Issue Links
- causes
-
SPARK-24567 nodeBlacklist does not get updated if a spark executor fails to launch on a mesos node
-
- Resolved
-
- is duplicated by
-
SPARK-23423 Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors
-
- Resolved
-
- is related to
-
SPARK-16630 Blacklist a node if executors won't launch on it.
-
- Resolved
-
- relates to
-
SPARK-23485 Kubernetes should support node blacklist
-
- Reopened
-
- links to