Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
Spark's current task cancellation / task killing mechanism is "best effort" in the sense that some tasks may not be interruptible and may not respond to their "killed" flags being set. If a significant fraction of a cluster's task slots are occupied by tasks that have been marked as killed but remain running then this can lead to a situation where new jobs and tasks are starved of resources because zombie tasks are holding resources.
I propose to address this problem by introducing a "task reaper" mechanism in executors to monitor tasks after they are marked for killing in order to periodically re-attempt the task kill, capture and log stacktraces / warnings if tasks do not exit in a timely manner, and, optionally, kill the entire executor JVM if cancelled tasks cannot be killed within some timeout.
Attachments
Issue Links
- relates to
-
SPARK-17064 Reconsider spark.job.interruptOnCancel
- Closed
- links to