Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21219

Task retry occurs on same executor due to race condition with blacklisting

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.1.1
    • 2.2.1, 2.3.0
    • Scheduler, Spark Core
    • None

    Description

      When a task fails it is (1) added into the pending task list and then (2) corresponding black list policy is enforced (ie, specifying if it can/can't run on a particular node/executor/etc.) Unfortunately the ordering is such that retrying the task could assign the task to the same executor, which, incidentally could be shutting down and immediately fail the retry. Instead the order should be (1) the black list state should be updated and then (2) the task assigned, ensuring that the black list policy is properly enforced.

      The attached logs demonstrate the race condition.

      See spark_executor.log.anon:

      1. Task 55.2 fails on the executor

      17/06/20 13:25:07 ERROR Executor: Exception in task 55.2 in stage 5.0 (TID 39575)
      java.lang.OutOfMemoryError: Java heap space

      2. Immediately the same executor is assigned the retry task:

      17/06/20 13:25:07 INFO CoarseGrainedExecutorBackend: Got assigned task 39651
      17/06/20 13:25:07 INFO Executor: Running task 55.3 in stage 5.0 (TID 39651)

      3. The retry task of course fails since the executor is also shutting down due to the original task 55.2 OOM failure.

      See the spark_driver.log.anon:

      The driver processes the lost task 55.2:

      17/06/20 13:25:07 WARN TaskSetManager: Lost task 55.2 in stage 5.0 (TID 39575, foobar####.masked-server.com, executor attempt_foobar####.masked-server.com-################.masked-server.com-################_0): java.lang.OutOfMemoryError: Java heap space

      The driver then receives the ExecutorLostFailure for the retry task 55.3 (although it's obfuscated in these logs, the server info is same...)

      17/06/20 13:25:10 WARN TaskSetManager: Lost task 55.3 in stage 5.0 (TID 39651, foobar####.masked-server.com, executor attempt_foobar####.masked-server.com-################.masked-server.com-################0): ExecutorLostFailure (executor attempt_foobar####.masked-server.com-################.masked-server.com-############_####_0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

      Attachments

        1. spark_executor.log.anon
          8 kB
          Eric Vandenberg
        2. spark_driver.log.anon
          4 kB
          Eric Vandenberg

        Activity

          People

            ericvandenbergfb Eric Vandenberg
            ericvandenbergfb Eric Vandenberg
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: