Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-50288

Executors of stage failed still alive even through stage has been retried

    XMLWordPrintableJSON

Details

    • Question
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.3.0, 3.5.2
    • None
    • Shuffle, Spark Core
    • None

    Description

      We are executing spark dataframe API with foreachPartition and we observed something that we are not able to explain.

      In the figure that we provided, you can see that stage 19 has been retried due to some ShuffleOutputNotFound error and stage 19 has been retried 1.

      However, we found that there are still some executors allocated for stage 19 in state RUNNING e.g. partition 371 and 200. In addition, as not yet finished, partition 371 and 200 have been resubmitted again in stage 19 (retry 1). 

      Is there any configuration can help us make sure that all executor from the stage failed will be terminated before retry staged been triggered.

      Attachments

        1. stage19_stage19retry1.png
          654 kB
          Yu-Ting LIN
        2. trigger_duplicate_process.png
          3.01 MB
          Yu-Ting LIN

        Activity

          People

            Unassigned Unassigned
            yutinglin Yu-Ting LIN
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: