Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-33235 Push-based Shuffle Improvement Tasks
  3. SPARK-48580

Add consistency check and fallback for mapIds in push-merged block meta

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.2.0, 3.3.0, 3.4.0, 3.5.0
    • None
    • Shuffle

    Description

      When push-based shuffle enabled, 0.03% of the spark application in our cluster experienced shuffle data loss. The metrics of Exchange as follows:

      We eventually found some WARN logs on the shuffle server:
       

      WARN shuffle-server-8-216 org.apache.spark.network.shuffle.RemoteBlockPushResolver: Application application_xxxx shuffleId 0 shuffleMergeId 0 reduceId 133 update to index/meta failed

       

      And analyzed the cause from the code:

      The merge metadata obtained by the reduce side from the driver comes from the mapTracker in the server's memory, while the actual reading of chunk data is based on the records in the shuffle server's metaFile. There is no consistency check between the two.

      Attachments

        Activity

          People

            Unassigned Unassigned
            gaoyajun02 gaoyajun02
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: