Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
2.1.0
-
None
-
None
Description
In our production spark cluster, we encoutered a case that the task of retry stage due to FetchFailure is denied to commit. However, the task is the first attempt of this retry stage.
After carefully investigating, it was found that the call of canCommit of OutputCommitCoordinator would allow the task of FetchFailure stage(with the same parition number as new task of retry stage) commit. which result in the TaskCommitDenied for all the task (same partition) of retry stage. Becuase of TaskCommitDenied is not countTowardsFailure, thus might cause Application hangs forever.
2019-01-09,08:39:53,676 INFO org.apache.spark.scheduler.TaskSetManager: Starting task 138.0 in stage 5.1 (TID 31437, zjy-hadoop-prc-st159.bj, executor 456, partition 138, PROCESS_LOCAL, 5829 bytes)
2019-01-09,08:43:37,514 INFO org.apache.spark.scheduler.TaskSetManager: Finished task 138.0 in stage 5.0 (TID 30634) in 466958 ms on zjy-hadoop-prc-st1212.bj (executor 1632) (674/5000)
2019-01-09,08:45:57,372 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 138.0 in stage 5.1 (TID 31437, zjy-hadoop-prc-st159.bj, executor 456): TaskCommitDenied (Driver denied task commit) for job: 5, partition: 138, attemptNumber: 1
166483 2019-01-09,08:45:57,373 INFO org.apache.spark.scheduler.OutputCommitCoordinator: Task was denied committing, stage: 5, partition: 138, attempt number: 0, attempt number(counting failed stage): 1
Attachments
Issue Links
- duplicates
-
SPARK-25250 Race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple times
- Resolved
- links to