Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.1.0
-
None
-
mesos, marathon, docker - driver and executors are dockerized.
Description
When for some reason task fails - MesosCoarseGrainedSchedulerBackend increased failure counter for a slave where that task was running.
When counter is >=2 (MAX_SLAVE_FAILURES) mesos slave is excluded.
Over time scheduler cannot create a new executor - every slave is is in the blacklist. Task failure not necessary related to host health- especially for long running stream apps.
If accepted as a bug: possible solution is to use: spark.blacklist.enabled to make that functionality optional and if it make sense MAX_SLAVE_FAILURES also can be configurable.
Attachments
Issue Links
- causes
-
SPARK-24567 nodeBlacklist does not get updated if a spark executor fails to launch on a mesos node
- Resolved
- is duplicated by
-
SPARK-23423 Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors
- Resolved
- is related to
-
SPARK-16630 Blacklist a node if executors won't launch on it.
- Resolved
- relates to
-
SPARK-23485 Kubernetes should support node blacklist
- Reopened
- links to
I'm closing this because the configs you're proposing adding already exist: spark.blacklist.enabled already exists to turn of all blacklisting (this is false by default, so the fact that you're seeing blacklisting behavior means that your configuration enables blacklisting), and spark.blacklist.maxFailedTaskPerExecutor is the other thing you proposed adding. All of the blacklisting parameters are listed here: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/package.scala#L101
Feel free to re-open this if I've misunderstood and the existing configs don't address the issues you're seeing!