Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-46182

Shuffle data lost on decommissioned executor caused by race condition between lastTaskRunningTime and lastShuffleMigrationTime

    XMLWordPrintableJSON

Details

    Description

      We recently identified a very tricky race condition in decommissioned node, which could lead to shuffle data lost even data migration is enabled:

      • At 04:30:51, RDD block refresh happened, and found no pending works
      • Shortly after that (a few milliseconds), the shutdownThread in CoarseGrainedExecutorBackend found 1 running task, so lastTaskRunningTime updated to the current system nano time
      • Shortly after that, Shuffle block refresh happened, and found no pending works
      • Shortly after that, a task finished on the decommissioned executor, and generated new shuffle blocks
      • One second later, the shutdownThread in CoarseGrainedExecutorBackend found no running task, lastTaskRunningTime would not be updated, and the executor didn’t exit because min(lastRDDMigrationTime, lastShuffleMigrationTime) < lastTaskRunningTime
      • After 30 seconds, at 04:31:21, RDD block refresh happened, and found no pending works, lastRDDMigrationTime updated to the current system nano time
      • At this exact moment, all known blocks are migrated, and min(lastRDDMigrationTime, lastShuffleMigrationTime) > lastTaskRunningTime
      • shutdownThread is triggered, and asked to stop the executor
      • Shuffle block refresh thread was still sleeping, and got interrupted by the stop command, so it didn’t have the chance to discover the shuffle blocks generated by the previously finished task
      • Eventually, the executor exited, and the output of the task was lost, Spark need to recompute that partition

      The root cause for the race condition is that the Shuffle block refresh happened between lastTaskRunningTime was updated and task finished, in that case the shutdownThread could request to stop the executor before the BlockManagerDecommissioner discover the new shuffle blocks generated by the latest finished task.

      Attachments

        Issue Links

          Activity

            People

              jiangxb1987 Xingbo Jiang
              jiangxb1987 Xingbo Jiang
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: