[SPARK-48580] Add consistency check and fallback for mapIds in push-merged block meta - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.2.0, 3.3.0, 3.4.0, 3.5.0
Fix Version/s: None
Component/s: Shuffle
Labels:
- pull-request-available

Description

When push-based shuffle enabled, 0.03% of the spark application in our cluster experienced shuffle data loss. The metrics of Exchange as follows:

We eventually found some WARN logs on the shuffle server:

WARN shuffle-server-8-216 org.apache.spark.network.shuffle.RemoteBlockPushResolver: Application application_xxxx shuffleId 0 shuffleMergeId 0 reduceId 133 update to index/meta failed

And analyzed the cause from the code：

The merge metadata obtained by the reduce side from the driver comes from the mapTracker in the server's memory, while the actual reading of chunk data is based on the records in the shuffle server's metaFile. There is no consistency check between the two.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2024-06-11-10-19-57-227.png
11/Jun/24 02:19
151 kB
gaoyajun02

Issue Links

links to

GitHub Pull Request #46934

https://github.com/apache/spark/pull/46934

Activity

People

Assignee:: Unassigned

Reporter:: gaoyajun02

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 11/Jun/24 02:15

Updated:: 29/Sep/24 00:26