[SPARK-37355] Avoid Block Manager registrations when Executor is shutting down - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: In Progress
Priority: Minor
Resolution: Unresolved
Affects Version/s: 3.2.0
Fix Version/s: None
Component/s: Spark Core
Labels:
None

Description

Note: Similar to ~~SPARK-34949~~ and ~~SPARK-35011~~, BlockManager will reregister itself while executor is shutting down.

Problem:

As describe in ~~SPARK-35011~~, HeartbeatReceiver.expireDeadHosts() will not clean those BlockManager if the executor is killed with reason Executor heartbeat timed out. Executors could heartbeat timed out because of network issue, or some other reason like ~~SPARK-20977~~

Logs:

// Driver Logs

21/11/13 05:06:20,999 WARN [dispatcher-event-loop-36] spark.HeartbeatReceiver:69 : Removing executor 3056 with no recent heartbeats: 350149 ms exceeds timeout 300000 ms
21/11/13 05:06:20,999 INFO [kill-executor-thread] cluster.YarnClientSchedulerBackend:57 : Requesting to kill executor(s) 3056
21/11/13 05:06:21,000 INFO [kill-executor-thread] cluster.YarnClientSchedulerBackend:57 : Actual list of executor(s) to be killed is 3056
21/11/13 05:06:21,000 INFO [dispatcher-event-loop-8] yarn.ApplicationMaster$AMEndpoint:57 : Driver requested to kill executor(s) 3056.
21/11/13 05:06:21,000 INFO [dispatcher-CoarseGrainedScheduler] cluster.YarnSchedulerBackend$YarnDriverEndpoint:57 : Asked to remove executor 3056 with reason Executor heartbeat timed out after 350149 ms
21/11/13 05:06:21,000 ERROR [dispatcher-CoarseGrainedScheduler] cluster.YarnScheduler:73 : Lost executor 3056 on executor_host: Executor heartbeat timed out after 350149 ms
21/11/13 05:06:21,041 WARN [dispatcher-CoarseGrainedScheduler] scheduler.TaskSetManager:69 : Lost task 262191.0 in stage 452627.0 (TID 245764597, executor_host, executor 3056): ExecutorLostFailure (executor 3056 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 350149 ms
21/11/13 05:06:21,067 WARN [dispatcher-CoarseGrainedScheduler] scheduler.TaskSetManager:69 : Lost task 259149.0 in stage 452627.0 (TID 245761130, executor_host, executor 3056): ExecutorLostFailure (executor 3056 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 350149 ms
21/11/13 05:06:22,068 INFO [dispatcher-BlockManagerMaster] storage.BlockManagerMasterEndpoint:57 : Trying to remove executor 3056 from BlockManagerMaster.
21/11/13 05:06:22,072 INFO [dispatcher-BlockManagerMaster] storage.BlockManagerMasterEndpoint:57 : Removing block manager BlockManagerId(3056, executor_host, 30504, None)
21/11/13 05:06:22,073 INFO [dag-scheduler-event-loop] storage.BlockManagerMaster:57 : Removed 3056 successfully in removeExecutor
21/11/13 05:06:22,962 INFO [dispatcher-BlockManagerMaster] storage.BlockManagerMasterEndpoint:57 : Registering block manager executor_host:30504 with 88.5 GiB RAM, BlockManagerId(3056, executor_host, 30504, None)


// Executor Logs

21/11/13 05:06:21,004 INFO [dispatcher-Executor] executor.YarnCoarseGrainedExecutorBackend:57 : Driver commanded a shutdown
21/11/13 05:06:22,215 INFO [block-manager-future-0] storage.BlockManagerMaster:57 : Registering BlockManager BlockManagerId(3056, executor_host, 30504, None)

Attachments

Issue Links

is duplicated by

SPARK-41360 Avoid BlockManager re-registration if the executor has been lost

Resolved

links to

[Github] Pull Request #34629 (wankunde)

Activity

People

Assignee:: Unassigned

Reporter:: Wan Kun

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 17/Nov/21 08:52

Updated:: 02/Dec/22 08:19