[SPARK-42737] Shuffle files lost with graceful decommission fallback storage enabled - ASF JIRA

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 3.3.2
Fix Version/s: None
Component/s: Spark Core
Labels:
None

Description

During testing of graceful decommissioning, the driver logs indicate that shuffle files were lost - `DAGScheduler: Shuffle files lost for executor`:

23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission executors: 3
23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers (BlockManagerId(3, 100.96.5.11, 44707, None)) as being decommissioning.
23/03/09 15:22:42 WARN KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor 1 decommissioned message
23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission executors: 1
23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers (BlockManagerId(1, 100.96.5.9, 44491, None)) as being decommissioning.
23/03/09 15:22:42 WARN KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor 2 decommissioned message
23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission executors: 2
23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers (BlockManagerId(2, 100.96.5.10, 39011, None)) as being decommissioning.
23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 3 on 100.96.5.11: Executor decommission.
23/03/09 15:22:44 INFO ExecutorMonitor: Executor 3 is removed. Remove reason statistics: (gracefully decommissioned: 1, decommision unfinished: 0, driver killed: 0, unexpectedly exited: 0).
23/03/09 15:22:44 INFO DAGScheduler: Executor lost: 3 (epoch 0)
23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 1 on 100.96.5.9: Executor decommission.
23/03/09 15:22:44 INFO ExecutorMonitor: Executor 1 is removed. Remove reason statistics: (gracefully decommissioned: 2, decommision unfinished: 0, driver killed: 0, unexpectedly exited: 0).
23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 2 on 100.96.5.10: Executor decommission.
23/03/09 15:22:44 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
23/03/09 15:22:44 INFO ExecutorMonitor: Executor 2 is removed. Remove reason statistics: (gracefully decommissioned: 3, decommision unfinished: 0, driver killed: 0, unexpectedly exited: 0).
23/03/09 15:22:44 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, 100.96.5.11, 44707, None)
23/03/09 15:22:44 INFO BlockManagerMaster: Removed 3 successfully in removeExecutor
23/03/09 15:22:44 INFO DAGScheduler: Shuffle files lost for executor: 3 (epoch 0)
23/03/09 15:22:44 INFO DAGScheduler: Executor lost: 1 (epoch 1)
23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, 100.96.5.9, 44491, None)
23/03/09 15:22:45 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor
23/03/09 15:22:45 INFO DAGScheduler: Shuffle files lost for executor: 1 (epoch 1)
23/03/09 15:22:45 INFO DAGScheduler: Executor lost: 2 (epoch 2)
23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(2, 100.96.5.10, 39011, None)
23/03/09 15:22:45 INFO BlockManagerMaster: Removed 2 successfully in removeExecutor
23/03/09 15:22:45 INFO DAGScheduler: Shuffle files lost for executor: 2 (epoch 2)
23/03/09 15:22:52 INFO BlockManagerMaster: Removal of executor 1 requested
23/03/09 15:22:52 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 1

The decommission logs from the executor also seems to indicate that no shuffle data was necessary to migrate:

23/03/09 15:22:42 INFO CoarseGrainedExecutorBackend: Decommission executor 1.
23/03/09 15:22:42 INFO CoarseGrainedExecutorBackend: Will exit when finished decommissioning
23/03/09 15:22:42 INFO BlockManager: Starting block manager decommissioning process...
23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: Checking to see if we can shutdown.
23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: No running tasks, checking migrations
23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: All blocks not yet migrated.
23/03/09 15:22:43 INFO BlockManagerDecommissioner: Starting block migration
23/03/09 15:22:43 INFO BlockManagerDecommissioner: Attempting to migrate all RDD blocks
23/03/09 15:22:43 INFO BlockManagerDecommissioner: Attempting to migrate all shuffle blocks
23/03/09 15:22:43 INFO BlockManagerDecommissioner: Start refreshing migratable shuffle blocks
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Attempting to migrate all cached RDD blocks
23/03/09 15:22:44 INFO BlockManagerDecommissioner: 0 of 0 local shuffles are added. In total, 0 shuffles are remained.
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Starting shuffle block migration thread for BlockManagerId(fallback, remote, 7337, None)
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Finished current round refreshing migratable shuffle blocks, waiting for 30000ms before the next round refreshing.
23/03/09 15:22:44 WARN BlockManagerDecommissioner: Asked to decommission RDD cache blocks, but no blocks to migrate
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Finished current round RDD blocks migration, waiting for 30000ms before the next round migration.
23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: Checking to see if we can shutdown.
23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: No running tasks, checking migrations
23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: No running tasks, all blocks migrated, stopping.
23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: Executor self-exiting due to : Finished decommissioning
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stop RDD blocks migration().
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stop refreshing migratable shuffle blocks.
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stopping migrating shuffle blocks.
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stopped block migration
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stop shuffle block migration().

This seems incorrect as there were no shuffle files to migrate to begin with. We enabled:

spark.decommission.enabled
spark.storage.decommission.rddBlocks.enabled
spark.storage.decommission.shuffleBlocks.enabled
spark.storage.decommission.enabled
and set spark.storage.decommission.fallbackStorage.path to a path in our bucket.

The same message was also shown when there were actually shuffle files that were stored in the bucket.

Shuffle files lost with graceful decommission fallback storage enabled

Details

Description

Attachments

Activity

People

Dates