Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-42784

Fix the problem of incomplete creation of subdirectories in push merged localDir

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.3.2
    • 3.3.3, 3.4.2, 3.5.0
    • Shuffle, Spark Core
    • None

    Description

      After we massively enabled push-based shuffle in our production environment, we found some warn messages appearing in the server-side log messages.

      the warning log like:

      ShuffleBlockPusher: Pushing block shufflePush_3_0_5352_935 to BlockManagerId(shuffle-push-merger, zw06-data-hdp-dn08251.mt, 7337, None) failed.
      java.lang.RuntimeException: java.lang.RuntimeException: Cannot initialize merged shuffle partition for appId application_1671244879475_44020960 shuffleId 3 shuffleMergeId 0 reduceId 935.

      After investigation, we identified the triggering mechanism of the bug。

      The driver requested two different containers on the same physical machine. During the creation of the 'push-merged' directory in the first container (container_1), the mergeDir was created first, then the subDir were created based on the value of the "spark.diskStore.subDirectories" parameter. However, the resources of container_1 were preempted during the creation of the sub-directories, resulting in subDir not being created (only part of it was created ). As the mergeDir still existed, the second container (container_2) was unable to create further subDir (as it assumed that all directories had already been created).

       

      Attachments

        Activity

          People

            StoveM Fencheng Mei
            StoveM Fencheng Mei
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: