Details
-
Sub-task
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.2.0, 3.3.0, 3.4.0, 3.5.0
-
None
Description
When push-based shuffle enabled, 0.03% of the spark application in our cluster experienced shuffle data loss. The metrics of Exchange as follows:
We eventually found some WARN logs on the shuffle server:
WARN shuffle-server-8-216 org.apache.spark.network.shuffle.RemoteBlockPushResolver: Application application_xxxx shuffleId 0 shuffleMergeId 0 reduceId 133 update to index/meta failed
And analyzed the cause from the code:
The merge metadata obtained by the reduce side from the driver comes from the mapTracker in the server's memory, while the actual reading of chunk data is based on the records in the shuffle server's metaFile. There is no consistency check between the two.