Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
Currently, celeborn uses appShuffle's deterministic level to decide whether to reuse shuffle id. However, as in the example of CELEBORN-1496, for a non-INDETERMINATE barrier stage, it requires celeborn to create a new shuffle id during recomputation. Currently, celeborn reuses the shuffle id.
The key issue is that the deterministic level is not the sole criterion for reusing the shuffleId. In Spark, the DAGScheduler determines whether to reuse task output based on multiple factors, including the deterministic level. If it determines that the outputs cannot be reused, it proactively cleans up the map output. Therefore, celeborn should decide whether to reuse the shuffle id based on whether the map output is empty, effectively delegating the decision of reusing previous attempt results to the DAGScheduler.
In addition, getOutputDeterministicLevel is a protected method of the RDD, which introduces the risk of it being dynamically changed during execution. Currently, celeborn only checks this once when registering the shuffle, which may pose a risk. However, this possibility is very low, and I have not encountered a situation where getOutputDeterministicLevel changes during execution.
Attachments
Issue Links
- links to