Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-2666

Always try to cancel running tasks when a stage is marked as zombie

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • None
    • None
    • Scheduler, Spark Core

    Description

      There are some situations in which the scheduler can mark a task set as a "zombie" before the task set has completed all of its tasks. For example:

      (a) When a task fails b/c of a FetchFailed
      (b) When a stage completes because two different attempts create all the ShuffleMapOutput, though no attempt has completed all its tasks (at least, this should result in the task set being marked as zombie, see SPARK-10370)

      (there may be others, I'm not sure if this list is exhaustive.)

      Marking a taskset as zombie prevents any additional tasks from getting scheduled, however it does not cancel all currently running tasks. We should cancel all running to avoid wasting resources (and also to make the behavior a little more clear to the end user). Rather than canceling tasks in each case piecemeal, we should refactor the scheduler so that these two actions are always taken together – canceling tasks should go hand-in-hand with marking the taskset as zombie.

      Some implementation notes:

      • We should change taskSetManager.isZombie to be private and put it behind a method like markZombie or something.
      • marking a stage as zombie before the all tasks have completed does not necessarily mean the stage attempt has failed. In case (a), the stage attempt has failed, but in stage (b) we are not canceling b/c of a failure, rather just b/c no more tasks are needed.
      • taskScheduler.cancelTasks always marks the task set as zombie. However, it also has some side-effects like logging that the stage has failed and creating a TaskSetFailed event, which we don't want eg. in case (b) when nothing has failed. So it may need some additional refactoring to go along w/ markZombie.
      • SchedulerBackend's are free to not implement killTask, so we need to be sure to catch the UnsupportedOperationException s
      • Testing this might benefit from SPARK-10372

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            lianhuiwang Lianhui Wang
            Votes:
            2 Vote for this issue
            Watchers:
            14 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment