Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-40082

DAGScheduler may not schduler new stage in condition of push-based shuffle enabled

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.1
    • 3.5.0
    • Scheduler
    • None

    Description

      In condition of push-based shuffle being enabled and speculative tasks existing, a shuffleMapStage will be resubmitting once fetchFailed occurring, then its parent stages will be resubmitting firstly and it will cost some time to compute. Before the shuffleMapStage being resubmitted, its all speculative tasks success and register map output, but speculative task successful events can not trigger shuffleMergeFinalized because this stage has been removed from runningStages.

      Then this stage is resubmitted, but speculative tasks have registered map output and there are no missing tasks to compute, resubmitting stages will also not trigger shuffleMergeFinalized. Eventually this stage‘s _shuffleMergedFinalized keeps false.

      Then AQE will submit next stages which are dependent on  this shuffleMapStage occurring fetchFailed. And in getMissingParentStages, this stage will be marked as missing and will be resubmitted, but next stages are added to waitingStages after this stage being finished, so next stages will not be submitted even though this stage's resubmitting has been finished.

      I have only met some times in my production env and it is difficult to reproduce。

      Attachments

        1. missParentStages.png
          220 kB
          Penglei Shi
        2. shuffleMergeFinalized.png
          129 kB
          Penglei Shi
        3. submitMissingTasks.png
          171 kB
          Penglei Shi

        Activity

          People

            StoveM Fencheng Mei
            Penglei Shi Penglei Shi
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: