Details
-
Bug
-
Status: In Progress
-
Major
-
Resolution: Unresolved
-
3.3.2
-
None
-
None
Description
After we massively enabled push-based shuffle in our production environment, we found some warn messages appearing in the server-side log messages.
the warning log like:
ShuffleBlockPusher: Pushing block shufflePush_3_0_5352_935 to BlockManagerId(shuffle-push-merger, zw06-data-hdp-dn08251.mt, 7337, None) failed.
java.lang.RuntimeException: java.lang.RuntimeException: Cannot initialize merged shuffle partition for appId application_1671244879475_44020960 shuffleId 3 shuffleMergeId 0 reduceId 935.
After investigation, we identified the triggering mechanism of the bug。
The driver requested two different containers on the same physical machine. During the creation of the 'push-merged' directory in the first container (container_1), the mergeDir was created first, then the subDir were created based on the value of the "spark.diskStore.subDirectories" parameter. However, the resources of container_1 were preempted during the creation of the sub-directories, resulting in subDir not being created (only part of it was created ). As the mergeDir still existed, the second container (container_2) was unable to create further subDir (as it assumed that all directories had already been created).