Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-8167

Tasks that fail due to YARN preemption can cause job failure

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.3.1
    • Fix Version/s: 1.6.0
    • Component/s: Scheduler, Spark Core, YARN
    • Labels:
      None
    • Target Version/s:

      Description

      Tasks that are running on preempted executors will count as FAILED with an ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if a large resource shift is occurring, and the tasks get scheduled to executors that immediately get preempted as well.

      The current workaround is to increase spark.task.maxFailures very high, but that can cause delays in true failures. We should ideally differentiate these task statuses so that they don't count towards the failure limit.

        Attachments

          Activity

            People

            • Assignee:
              mcheah Matt Cheah
              Reporter:
              pwoody Patrick Woody
            • Votes:
              1 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: