[SPARK-50288] Executors of stage failed still alive even through stage has been retried - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Question
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.3.0, 3.5.2
Fix Version/s: None
Component/s: Shuffle, Spark Core
Labels:
None

Description

We are executing spark dataframe API with foreachPartition and we observed something that we are not able to explain.

In the figure that we provided, you can see that stage 19 has been retried due to some ShuffleOutputNotFound error and stage 19 has been retried 1.

However, we found that there are still some executors allocated for stage 19 in state RUNNING e.g. partition 371 and 200. In addition, as not yet finished, partition 371 and 200 have been resubmitted again in stage 19 (retry 1).

Is there any configuration can help us make sure that all executor from the stage failed will be terminated before retry staged been triggered.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

stage19_stage19retry1.png
12/Nov/24 07:21
654 kB
Yu-Ting LIN
trigger_duplicate_process.png
12/Nov/24 07:22
3.01 MB
Yu-Ting LIN

Activity

People

Assignee:: Unassigned

Reporter:: Yu-Ting LIN

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 12/Nov/24 07:21

Updated:: 27/Nov/24 11:51