Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.0.1
-
None
-
- cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.2 (Maipo)
- cat /etc/redhat-release
Description
Executor sends AskPermissionToCommitOutput to driver failed, and retry another sending. Driver receives 2 AskPermissionToCommitOutput messages and handles them. But executor ignores the first response(true) and receives the second response(false). The TaskAttemptNumber for this partition in authorizedCommittersByStage is locked forever. Driver enters into infinite loop.
Driver Log:
16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) ... 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, partition: 24, attemptNumber: 0 ... 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, stage: 2, partition: 24, attempt: 0 ... 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) ... 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, partition: 24, attemptNumber: 1 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, stage: 2, partition: 24, attempt: 1 ... 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) ...
Executor Log:
... 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110) ... 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = AskPermissionToCommitOutput(2,24,0)] in 1 attempts org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.rpc.askTimeout at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78) at org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95) at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73) at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:279) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.lang.Thread.run(Thread.java:785) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:190) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81) ... 13 more ... 16/10/25 05:39:16 INFO Executor: Running task 24.1 in stage 2.0 (TID 119) ... 16/10/25 05:39:24 INFO SparkHadoopMapRedUtil: attempt_201610250536_0002_m_000024_119: Not committed because the driver did not authorize commit ...
Attachments
Issue Links
- relates to
-
SPARK-8029 ShuffleMapTasks must be robust to concurrent attempts on the same executor
- Resolved
- links to