Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18113

Sending AskPermissionToCommitOutput failed, driver enter into task deadloop

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.1
    • 2.2.0
    • Scheduler, Spark Core
    • None
      1. cat /etc/redhat-release
        Red Hat Enterprise Linux Server release 7.2 (Maipo)

    Description

      Executor sends AskPermissionToCommitOutput to driver failed, and retry another sending. Driver receives 2 AskPermissionToCommitOutput messages and handles them. But executor ignores the first response(true) and receives the second response(false). The TaskAttemptNumber for this partition in authorizedCommittersByStage is locked forever. Driver enters into infinite loop.

      Driver Log:

      16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
      ...
      16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, partition: 24, attemptNumber: 0
      ...
      16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, stage: 2, partition: 24, attempt: 0
      ...
      16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
      ...
      16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, partition: 24, attemptNumber: 1
      16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, stage: 2, partition: 24, attempt: 1
      ...
      16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
      ...
      

      Executor Log:

      ...
      16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110)
      ...
      16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = AskPermissionToCommitOutput(2,24,0)] in 1 attempts
      org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.rpc.askTimeout
              at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
              at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
              at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
              at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
              at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
              at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
              at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
              at org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95)
              at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73)
              at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
              at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212)
              at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190)
              at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
              at org.apache.spark.scheduler.Task.run(Task.scala:86)
              at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:279)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
              at java.lang.Thread.run(Thread.java:785)
      Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
              at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
              at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
              at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
              at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
              at scala.concurrent.Await$.result(package.scala:190)
              at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
              ... 13 more
      ...
      16/10/25 05:39:16 INFO Executor: Running task 24.1 in stage 2.0 (TID 119)
      ...
      16/10/25 05:39:24 INFO SparkHadoopMapRedUtil: attempt_201610250536_0002_m_000024_119: Not committed because the driver did not authorize commit
      ...
      

      Attachments

        Issue Links

          Activity

            People

              jinxing6042@126.com Jin Xing
              xq2005 xuqing
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: