[SPARK-22148] TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current executors are blacklisted but dynamic allocation is enabled - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.2.0
Fix Version/s: 2.4.1, 3.0.0
Component/s: Scheduler, Spark Core
Labels:
None

Description

Currently TaskSetManager.abortIfCompletelyBlacklisted aborts the TaskSet and the whole Spark job with `task X (partition Y) cannot run anywhere due to node and executor blacklist. Blacklisting behavior can be configured via spark.blacklist.*.` when all the available executors are blacklisted for a pending Task or TaskSet. This makes sense for static allocation, where the set of executors is fixed for the duration of the application, but this might lead to unnecessary job failures when dynamic allocation is enabled. For example, in a Spark application with a single job at a time, when a node fails at the end of a stage attempt, all other executors will complete their tasks, but the tasks running in the executors of the failing node will be pending. Spark will keep waiting for those tasks for 2 minutes by default (spark.network.timeout) until the heartbeat timeout is triggered, and then it will blacklist those executors for that stage. At that point in time, other executors would had been released after being idle for 1 minute by default (spark.dynamicAllocation.executorIdleTimeout), because the next stage hasn't started yet and so there are no more tasks available (assuming the default of spark.speculation = false). So Spark will fail because the only executors available are blacklisted for that stage.

An alternative is requesting more executors to the cluster manager in this situation. This could be retried a configurable number of times after a configurable wait time between request attempts, so if the cluster manager fails to provide a suitable executor then the job is aborted like in the previous case.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SPARK-22148_WIP.diff
24/Oct/17 17:16
8 kB
Juan Rodríguez Hortalá

Issue Links

duplicates

SPARK-15815 Hang while enable blacklistExecutor and DynamicExecutorAllocator

Resolved

is related to

SPARK-31418 Blacklisting feature aborts Spark job without retrying for max num retries in case of Dynamic allocation

Resolved

relates to

SPARK-15815 Hang while enable blacklistExecutor and DynamicExecutorAllocator

Resolved

SPARK-24413 Executor Blacklisting shouldn't immediately fail the application if dynamic allocation is enabled and no active executors

Resolved

links to

[Github] Pull Request #19590 (juanrh)

[Github] Pull Request #22288 (dhruve)

(1 links to)

Activity

People

Assignee:: Dhruve Ashar

Reporter:: Juan Rodríguez Hortalá

Votes:: 1 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 27/Sep/17 17:28

Updated:: 13/Apr/20 23:22

Resolved:: 06/Nov/18 14:26