Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26634

OutputCommitCoordinator may allow task of FetchFailureStage commit again

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.1.0
    • None
    • Spark Core
    • None

    Description

      In our production spark cluster, we encoutered a case that the task of retry stage due to FetchFailure is denied to commit. However, the task is the first attempt of this retry stage.

      After carefully investigating, it was found that the call of canCommit of OutputCommitCoordinator would allow the task of FetchFailure stage(with the same parition number as new task of retry stage) commit. which result in the TaskCommitDenied for all the task (same partition) of retry stage. Becuase of TaskCommitDenied is not countTowardsFailure, thus might cause Application hangs forever.

       

      2019-01-09,08:39:53,676 INFO org.apache.spark.scheduler.TaskSetManager: Starting task 138.0 in stage 5.1 (TID 31437, zjy-hadoop-prc-st159.bj, executor 456, partition 138, PROCESS_LOCAL, 5829 bytes)
      2019-01-09,08:43:37,514 INFO org.apache.spark.scheduler.TaskSetManager: Finished task 138.0 in stage 5.0 (TID 30634) in 466958 ms on zjy-hadoop-prc-st1212.bj (executor 1632) (674/5000)
      2019-01-09,08:45:57,372 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 138.0 in stage 5.1 (TID 31437, zjy-hadoop-prc-st159.bj, executor 456): TaskCommitDenied (Driver denied task commit) for job: 5, partition: 138, attemptNumber: 1
      166483 2019-01-09,08:45:57,373 INFO org.apache.spark.scheduler.OutputCommitCoordinator: Task was denied committing, stage: 5, partition: 138, attempt number: 0, attempt number(counting failed stage): 1
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              liupengcheng liupengcheng
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: