Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-38969

Graceful decomissionning on Kubernetes fails / decom script error

    XMLWordPrintableJSON

Details

    • Bug
    • Status: In Progress
    • Minor
    • Resolution: Unresolved
    • 3.2.0
    • None
    • Spark Core
    • None
    • Running spark-thriftserver (3.2.0) on Kubernetes (GKE 1.20.15-gke.2500). 

       

    Description

      Hello, we are running into some issue while attempting graceful decommissioning of executors. We enabled:

      • spark.decommission.enabled 
      • spark.storage.decommission.rddBlocks.enabled
      • spark.storage.decommission.shuffleBlocks.enabled
      • spark.storage.decommission.enabled

      and set spark.storage.decommission.fallbackStorage.path to a path in our bucket.
       
      The logs from the driver seems to suggest the decommissioning process started but then unexpectedly exited and failed:
       
      ```
      22/04/20 15:09:09 WARN KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor 3 decommissioned message
      22/04/20 15:09:09 INFO KubernetesClusterSchedulerBackend: Decommission executors: 3
      22/04/20 15:09:09 INFO BlockManagerMasterEndpoint: Mark BlockManagers (BlockManagerId(3, 100.96.1.130, 44789, None)) as being decommissioning.
      22/04/20 15:09:10 ERROR TaskSchedulerImpl: Lost executor 3 on 100.96.1.130: Executor decommission.
      22/04/20 15:09:10 INFO DAGScheduler: Executor lost: 3 (epoch 2)
      22/04/20 15:09:10 INFO ExecutorMonitor: Executor 3 is removed. Remove reason statistics: (gracefully decommissioned: 0, decommision unfinished: 0, driver killed: 0, unexpectedly exited: 3).
      22/04/20 15:09:10 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
      22/04/20 15:09:10 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, 100.96.1.130, 44789, None)
      22/04/20 15:09:10 INFO BlockManagerMaster: Removed 3 successfully in removeExecutor
      22/04/20 15:09:10 INFO DAGScheduler: Shuffle files lost for executor: 3 (epoch 2)
      ```
       
      However, the executor logs seem to suggest that decommissioning was successful:
       
      ```
      22/04/20 15:09:09 INFO CoarseGrainedExecutorBackend: Decommission executor 3.
      22/04/20 15:09:09 INFO CoarseGrainedExecutorBackend: Will exit when finished decommissioning
      22/04/20 15:09:09 INFO BlockManager: Starting block manager decommissioning process...
      22/04/20 15:09:10 INFO BlockManagerDecommissioner: Starting block migration
      22/04/20 15:09:10 INFO BlockManagerDecommissioner: Attempting to migrate all RDD blocks
      22/04/20 15:09:10 INFO BlockManagerDecommissioner: Attempting to migrate all shuffle blocks
      22/04/20 15:09:10 INFO BlockManagerDecommissioner: Start refreshing migratable shuffle blocks
      22/04/20 15:09:10 INFO BlockManagerDecommissioner: 0 of 0 local shuffles are added. In total, 0 shuffles are remained.
      22/04/20 15:09:10 INFO BlockManagerDecommissioner: Attempting to migrate all cached RDD blocks
      22/04/20 15:09:10 INFO BlockManagerDecommissioner: Starting shuffle block migration thread for BlockManagerId(4, 100.96.1.131, 35607, None)
      22/04/20 15:09:10 INFO BlockManagerDecommissioner: Starting shuffle block migration thread for BlockManagerId(fallback, remote, 7337, None)
      22/04/20 15:09:10 INFO BlockManagerDecommissioner: Finished current round refreshing migratable shuffle blocks, waiting for 30000ms before the next round refreshing.
      22/04/20 15:09:10 WARN BlockManagerDecommissioner: Asked to decommission RDD cache blocks, but no blocks to migrate
      22/04/20 15:09:10 INFO BlockManagerDecommissioner: Finished current round RDD blocks migration, waiting for 30000ms before the next round migration.
      22/04/20 15:09:10 INFO CoarseGrainedExecutorBackend: Checking to see if we can shutdown.
      22/04/20 15:09:10 INFO CoarseGrainedExecutorBackend: No running tasks, checking migrations
      22/04/20 15:09:10 INFO CoarseGrainedExecutorBackend: No running tasks, all blocks migrated, stopping.
      22/04/20 15:09:10 ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Finished decommissioning
      22/04/20 15:09:10 INFO BlockManagerDecommissioner: Stop RDD blocks migration().
      22/04/20 15:09:10 INFO BlockManagerDecommissioner: Stop refreshing migratable shuffle blocks.
      22/04/20 15:09:10 INFO BlockManagerDecommissioner: Stopping migrating shuffle blocks.
      22/04/20 15:09:10 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
      22/04/20 15:09:10 INFO BlockManagerDecommissioner: Stopped block migration
      22/04/20 15:09:10 INFO BlockManagerDecommissioner: Stop shuffle block migration().
      22/04/20 15:09:10 INFO BlockManagerDecommissioner: Stop shuffle block migration().
      22/04/20 15:09:10 INFO MemoryStore: MemoryStore cleared
      22/04/20 15:09:10 INFO BlockManager: BlockManager stopped
      22/04/20 15:09:10 INFO ShutdownHookManager: Shutdown hook called
      ```
       
      The decommissioning script `/opt/decom.sh` also always terminates with exit code 137, not really sure why that is.
       
       
       

      Attachments

        Activity

          People

            Unassigned Unassigned
            yeachan153 Yeachan Park
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: