Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-14915

Tasks that fail due to CommitDeniedException (a side-effect of speculation) can cause job to never complete

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.5.3, 1.6.2, 2.0.0
    • Fix Version/s: 1.6.2, 2.0.0
    • Component/s: None
    • Labels:
      None
    • Target Version/s:

      Description

      In SPARK-14357, code was corrected towards the originally intended behavior that a CommitDeniedException should not count towards the failure count for a job. After having run with this fix for a few weeks, it's become apparent that this behavior has some unintended consequences - that a speculative task will continuously receive a CDE from the driver, now causing it to fail and retry over and over without limit.

      I'm thinking we could put a task that receives a CDE from the driver, into a TaskState.FINISHED or some other state to indicated that the task shouldn't be resubmitted by the TaskScheduler. I'd probably need some opinions on whether there are other consequences for doing something like this.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                jasonmoore2k Jason Moore
                Reporter:
                jasonmoore2k Jason Moore
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: