Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-42737

Shuffle files lost with graceful decommission fallback storage enabled

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 3.3.2
    • None
    • Spark Core
    • None

    Description

      During testing of graceful decommissioning, the driver logs indicate that shuffle files were lost - `DAGScheduler: Shuffle files lost for executor`:

      23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission executors: 3
      23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers (BlockManagerId(3, 100.96.5.11, 44707, None)) as being decommissioning.
      23/03/09 15:22:42 WARN KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor 1 decommissioned message
      23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission executors: 1
      23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers (BlockManagerId(1, 100.96.5.9, 44491, None)) as being decommissioning.
      23/03/09 15:22:42 WARN KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor 2 decommissioned message
      23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission executors: 2
      23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers (BlockManagerId(2, 100.96.5.10, 39011, None)) as being decommissioning.
      23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 3 on 100.96.5.11: Executor decommission.
      23/03/09 15:22:44 INFO ExecutorMonitor: Executor 3 is removed. Remove reason statistics: (gracefully decommissioned: 1, decommision unfinished: 0, driver killed: 0, unexpectedly exited: 0).
      23/03/09 15:22:44 INFO DAGScheduler: Executor lost: 3 (epoch 0)
      23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 1 on 100.96.5.9: Executor decommission.
      23/03/09 15:22:44 INFO ExecutorMonitor: Executor 1 is removed. Remove reason statistics: (gracefully decommissioned: 2, decommision unfinished: 0, driver killed: 0, unexpectedly exited: 0).
      23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 2 on 100.96.5.10: Executor decommission.
      23/03/09 15:22:44 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
      23/03/09 15:22:44 INFO ExecutorMonitor: Executor 2 is removed. Remove reason statistics: (gracefully decommissioned: 3, decommision unfinished: 0, driver killed: 0, unexpectedly exited: 0).
      23/03/09 15:22:44 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, 100.96.5.11, 44707, None)
      23/03/09 15:22:44 INFO BlockManagerMaster: Removed 3 successfully in removeExecutor
      23/03/09 15:22:44 INFO DAGScheduler: Shuffle files lost for executor: 3 (epoch 0)
      23/03/09 15:22:44 INFO DAGScheduler: Executor lost: 1 (epoch 1)
      23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
      23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, 100.96.5.9, 44491, None)
      23/03/09 15:22:45 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor
      23/03/09 15:22:45 INFO DAGScheduler: Shuffle files lost for executor: 1 (epoch 1)
      23/03/09 15:22:45 INFO DAGScheduler: Executor lost: 2 (epoch 2)
      23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
      23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(2, 100.96.5.10, 39011, None)
      23/03/09 15:22:45 INFO BlockManagerMaster: Removed 2 successfully in removeExecutor
      23/03/09 15:22:45 INFO DAGScheduler: Shuffle files lost for executor: 2 (epoch 2)
      23/03/09 15:22:52 INFO BlockManagerMaster: Removal of executor 1 requested
      23/03/09 15:22:52 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 1
      

      The decommission logs from the executor also seems to indicate that no shuffle data was necessary to migrate:

      23/03/09 15:22:42 INFO CoarseGrainedExecutorBackend: Decommission executor 1.
      23/03/09 15:22:42 INFO CoarseGrainedExecutorBackend: Will exit when finished decommissioning
      23/03/09 15:22:42 INFO BlockManager: Starting block manager decommissioning process...
      23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: Checking to see if we can shutdown.
      23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: No running tasks, checking migrations
      23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: All blocks not yet migrated.
      23/03/09 15:22:43 INFO BlockManagerDecommissioner: Starting block migration
      23/03/09 15:22:43 INFO BlockManagerDecommissioner: Attempting to migrate all RDD blocks
      23/03/09 15:22:43 INFO BlockManagerDecommissioner: Attempting to migrate all shuffle blocks
      23/03/09 15:22:43 INFO BlockManagerDecommissioner: Start refreshing migratable shuffle blocks
      23/03/09 15:22:44 INFO BlockManagerDecommissioner: Attempting to migrate all cached RDD blocks
      23/03/09 15:22:44 INFO BlockManagerDecommissioner: 0 of 0 local shuffles are added. In total, 0 shuffles are remained.
      23/03/09 15:22:44 INFO BlockManagerDecommissioner: Starting shuffle block migration thread for BlockManagerId(fallback, remote, 7337, None)
      23/03/09 15:22:44 INFO BlockManagerDecommissioner: Finished current round refreshing migratable shuffle blocks, waiting for 30000ms before the next round refreshing.
      23/03/09 15:22:44 WARN BlockManagerDecommissioner: Asked to decommission RDD cache blocks, but no blocks to migrate
      23/03/09 15:22:44 INFO BlockManagerDecommissioner: Finished current round RDD blocks migration, waiting for 30000ms before the next round migration.
      23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: Checking to see if we can shutdown.
      23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: No running tasks, checking migrations
      23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: No running tasks, all blocks migrated, stopping.
      23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: Executor self-exiting due to : Finished decommissioning
      23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stop RDD blocks migration().
      23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stop refreshing migratable shuffle blocks.
      23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stopping migrating shuffle blocks.
      23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stopped block migration
      23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stop shuffle block migration().
      

      This seems incorrect as there were no shuffle files to migrate to begin with. We enabled:

      • spark.decommission.enabled
      • spark.storage.decommission.rddBlocks.enabled
      • spark.storage.decommission.shuffleBlocks.enabled
      • spark.storage.decommission.enabled
        and set spark.storage.decommission.fallbackStorage.path to a path in our bucket.

      The same message was also shown when there were actually shuffle files that were stored in the bucket.

      Attachments

        Activity

          People

            Unassigned Unassigned
            yeachan153 Yeachan Park
            Holden Karau Holden Karau
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: