Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
2.3.0
-
None
-
None
Description
Currently with executor blacklisting enabled, dynamic allocation on, and you only have 1 active executor (spark.blacklist.killBlacklistedExecutors setting doesn't matter in this case, can be on or off), if you have a task fail that results in the 1 executor you have getting blacklisted, then your entire application will fail. The error you get is something like:
Aborting TaskSet 0.0 because task 9 (partition 9)
cannot run anywhere due to node and executor blacklist.
This is very undesirable behavior because you may have a huge job but one task is the long tail and if it happens to hit a bad node that would blacklist it, the entire job fail.
Ideally since dynamic allocation is on, the schedule should not immediately fail but it should let dynamic allocation try to get more executors.
Attachments
Issue Links
- is related to
-
SPARK-22148 TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current executors are blacklisted but dynamic allocation is enabled
- Resolved
-
SPARK-15815 Hang while enable blacklistExecutor and DynamicExecutorAllocator
- Resolved