Details
-
Question
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.3.0, 3.5.2
-
None
-
None
Description
We are executing spark dataframe API with foreachPartition and we observed something that we are not able to explain.
In the figure that we provided, you can see that stage 19 has been retried due to some ShuffleOutputNotFound error and stage 19 has been retried 1.
However, we found that there are still some executors allocated for stage 19 in state RUNNING e.g. partition 371 and 200. In addition, as not yet finished, partition 371 and 200 have been resubmitted again in stage 19 (retry 1).Â
Is there any configuration can help us make sure that all executor from the stage failed will be terminated before retry staged been triggered.