Details
-
Task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.4.1, 3.5.0
Description
We recently identified a very tricky race condition in decommissioned node, which could lead to shuffle data lost even data migration is enabled:
- At 04:30:51, RDD block refresh happened, and found no pending works
- Shortly after that (a few milliseconds), the shutdownThread in CoarseGrainedExecutorBackend found 1 running task, so lastTaskRunningTime updated to the current system nano time
- Shortly after that, Shuffle block refresh happened, and found no pending works
- Shortly after that, a task finished on the decommissioned executor, and generated new shuffle blocks
- One second later, the shutdownThread in CoarseGrainedExecutorBackend found no running task, lastTaskRunningTime would not be updated, and the executor didn’t exit because min(lastRDDMigrationTime, lastShuffleMigrationTime) < lastTaskRunningTime
- After 30 seconds, at 04:31:21, RDD block refresh happened, and found no pending works, lastRDDMigrationTime updated to the current system nano time
- At this exact moment, all known blocks are migrated, and min(lastRDDMigrationTime, lastShuffleMigrationTime) > lastTaskRunningTime
- shutdownThread is triggered, and asked to stop the executor
- Shuffle block refresh thread was still sleeping, and got interrupted by the stop command, so it didn’t have the chance to discover the shuffle blocks generated by the previously finished task
- Eventually, the executor exited, and the output of the task was lost, Spark need to recompute that partition
The root cause for the race condition is that the Shuffle block refresh happened between lastTaskRunningTime was updated and task finished, in that case the shutdownThread could request to stop the executor before the BlockManagerDecommissioner discover the new shuffle blocks generated by the latest finished task.
Attachments
Issue Links
- links to