Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-37355

Avoid Block Manager registrations when Executor is shutting down

    XMLWordPrintableJSON

Details

    • Bug
    • Status: In Progress
    • Minor
    • Resolution: Unresolved
    • 3.2.0
    • None
    • Spark Core
    • None

    Description

      Note: Similar to SPARK-34949 and SPARK-35011, BlockManager will reregister itself while executor is shutting down.

      Problem:

      As describe in SPARK-35011, HeartbeatReceiver.expireDeadHosts() will not clean those BlockManager if the executor is killed with reason Executor heartbeat timed out. Executors could heartbeat timed out because of network issue, or some other reason like SPARK-20977

      Logs:

      // Driver Logs
      
      21/11/13 05:06:20,999 WARN [dispatcher-event-loop-36] spark.HeartbeatReceiver:69 : Removing executor 3056 with no recent heartbeats: 350149 ms exceeds timeout 300000 ms
      21/11/13 05:06:20,999 INFO [kill-executor-thread] cluster.YarnClientSchedulerBackend:57 : Requesting to kill executor(s) 3056
      21/11/13 05:06:21,000 INFO [kill-executor-thread] cluster.YarnClientSchedulerBackend:57 : Actual list of executor(s) to be killed is 3056
      21/11/13 05:06:21,000 INFO [dispatcher-event-loop-8] yarn.ApplicationMaster$AMEndpoint:57 : Driver requested to kill executor(s) 3056.
      21/11/13 05:06:21,000 INFO [dispatcher-CoarseGrainedScheduler] cluster.YarnSchedulerBackend$YarnDriverEndpoint:57 : Asked to remove executor 3056 with reason Executor heartbeat timed out after 350149 ms
      21/11/13 05:06:21,000 ERROR [dispatcher-CoarseGrainedScheduler] cluster.YarnScheduler:73 : Lost executor 3056 on executor_host: Executor heartbeat timed out after 350149 ms
      21/11/13 05:06:21,041 WARN [dispatcher-CoarseGrainedScheduler] scheduler.TaskSetManager:69 : Lost task 262191.0 in stage 452627.0 (TID 245764597, executor_host, executor 3056): ExecutorLostFailure (executor 3056 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 350149 ms
      21/11/13 05:06:21,067 WARN [dispatcher-CoarseGrainedScheduler] scheduler.TaskSetManager:69 : Lost task 259149.0 in stage 452627.0 (TID 245761130, executor_host, executor 3056): ExecutorLostFailure (executor 3056 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 350149 ms
      21/11/13 05:06:22,068 INFO [dispatcher-BlockManagerMaster] storage.BlockManagerMasterEndpoint:57 : Trying to remove executor 3056 from BlockManagerMaster.
      21/11/13 05:06:22,072 INFO [dispatcher-BlockManagerMaster] storage.BlockManagerMasterEndpoint:57 : Removing block manager BlockManagerId(3056, executor_host, 30504, None)
      21/11/13 05:06:22,073 INFO [dag-scheduler-event-loop] storage.BlockManagerMaster:57 : Removed 3056 successfully in removeExecutor
      21/11/13 05:06:22,962 INFO [dispatcher-BlockManagerMaster] storage.BlockManagerMasterEndpoint:57 : Registering block manager executor_host:30504 with 88.5 GiB RAM, BlockManagerId(3056, executor_host, 30504, None)
      
      
      // Executor Logs
      
      21/11/13 05:06:21,004 INFO [dispatcher-Executor] executor.YarnCoarseGrainedExecutorBackend:57 : Driver commanded a shutdown
      21/11/13 05:06:22,215 INFO [block-manager-future-0] storage.BlockManagerMaster:57 : Registering BlockManager BlockManagerId(3056, executor_host, 30504, None)
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              wankun Wan Kun
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: