Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18761

Uncancellable / unkillable tasks may starve jobs of resoures

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0.3, 2.1.1, 2.2.0
    • Component/s: Spark Core
    • Labels:
      None

      Description

      Spark's current task cancellation / task killing mechanism is "best effort" in the sense that some tasks may not be interruptible and may not respond to their "killed" flags being set. If a significant fraction of a cluster's task slots are occupied by tasks that have been marked as killed but remain running then this can lead to a situation where new jobs and tasks are starved of resources because zombie tasks are holding resources.

      I propose to address this problem by introducing a "task reaper" mechanism in executors to monitor tasks after they are marked for killing in order to periodically re-attempt the task kill, capture and log stacktraces / warnings if tasks do not exit in a timely manner, and, optionally, kill the entire executor JVM if cancelled tasks cannot be killed within some timeout.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                joshrosen Josh Rosen
                Reporter:
                joshrosen Josh Rosen
              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: