[SPARK-46182] Shuffle data lost on decommissioned executor caused by race condition between lastTaskRunningTime and lastShuffleMigrationTime - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.1, 3.5.0
Fix Version/s: 4.0.0, 3.5.1, 3.4.3
Component/s: Spark Core
Labels:
- pull-request-available

Description

We recently identified a very tricky race condition in decommissioned node, which could lead to shuffle data lost even data migration is enabled:

At 04:30:51, RDD block refresh happened, and found no pending works
Shortly after that (a few milliseconds), the shutdownThread in CoarseGrainedExecutorBackend found 1 running task, so lastTaskRunningTime updated to the current system nano time
Shortly after that, Shuffle block refresh happened, and found no pending works
Shortly after that, a task finished on the decommissioned executor, and generated new shuffle blocks
One second later, the shutdownThread in CoarseGrainedExecutorBackend found no running task, lastTaskRunningTime would not be updated, and the executor didn’t exit because min(lastRDDMigrationTime, lastShuffleMigrationTime) < lastTaskRunningTime
After 30 seconds, at 04:31:21, RDD block refresh happened, and found no pending works, lastRDDMigrationTime updated to the current system nano time
At this exact moment, all known blocks are migrated, and min(lastRDDMigrationTime, lastShuffleMigrationTime) > lastTaskRunningTime
shutdownThread is triggered, and asked to stop the executor
Shuffle block refresh thread was still sleeping, and got interrupted by the stop command, so it didn’t have the chance to discover the shuffle blocks generated by the previously finished task
Eventually, the executor exited, and the output of the task was lost, Spark need to recompute that partition

The root cause for the race condition is that the Shuffle block refresh happened between lastTaskRunningTime was updated and task finished, in that case the shutdownThread could request to stop the executor before the BlockManagerDecommissioner discover the new shuffle blocks generated by the latest finished task.

Attachments

Issue Links

links to

GitHub Pull Request #44090

Activity

People

Assignee:: Xingbo Jiang

Reporter:: Xingbo Jiang

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 30/Nov/23 03:09

Updated:: 04/Dec/23 19:00

Resolved:: 04/Dec/23 06:09