[SPARK-18761] Uncancellable / unkillable tasks may starve jobs of resoures - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.0.3, 2.1.1, 2.2.0
Component/s: Spark Core
Labels:
None

Description

Spark's current task cancellation / task killing mechanism is "best effort" in the sense that some tasks may not be interruptible and may not respond to their "killed" flags being set. If a significant fraction of a cluster's task slots are occupied by tasks that have been marked as killed but remain running then this can lead to a situation where new jobs and tasks are starved of resources because zombie tasks are holding resources.

I propose to address this problem by introducing a "task reaper" mechanism in executors to monitor tasks after they are marked for killing in order to periodically re-attempt the task kill, capture and log stacktraces / warnings if tasks do not exit in a timely manner, and, optionally, kill the entire executor JVM if cancelled tasks cannot be killed within some timeout.

Attachments

Issue Links

relates to

SPARK-17064 Reconsider spark.job.interruptOnCancel

Closed

links to

[Github] Pull Request #16189 (JoshRosen)

[Github] Pull Request #16358 (JoshRosen)

Activity

People

Assignee:: Josh Rosen

Reporter:: Josh Rosen

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 07/Dec/16 06:26

Updated:: 20/Dec/16 23:57

Resolved:: 20/Dec/16 02:44