Uploaded image for project: 'Apache Celeborn'
  1. Apache Celeborn
  2. CELEBORN-1498

Decide whether to reuse the shuffle id based on the appShuffle's numAvailableOutputs

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None

    Description

      Currently, celeborn uses appShuffle's deterministic level to decide whether to reuse shuffle id. However, as in the example of CELEBORN-1496, for a non-INDETERMINATE barrier stage, it requires celeborn to create a new shuffle id during recomputation. Currently, celeborn reuses the shuffle id.

      The key issue is that the deterministic level is not the sole criterion for reusing the shuffleId. In Spark, the DAGScheduler determines whether to reuse task output based on multiple factors, including the deterministic level. If it determines that the outputs cannot be reused, it proactively cleans up the map output. Therefore, celeborn should decide whether to reuse the shuffle id based on whether the map output is empty, effectively delegating the decision of reusing previous attempt results to the DAGScheduler.

      In addition, getOutputDeterministicLevel is a protected method of the RDD, which introduces the risk of it being dynamically changed during execution. Currently, celeborn only checks this once when registering the shuffle, which may pose a risk. However, this possibility is very low, and I have not encountered a situation where getOutputDeterministicLevel changes during execution.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jiang13021 jiang13021
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 50m
                  1h 50m