[SPARK-35011] False active executor in UI that caused by BlockManager reregistration - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.1.1, 3.2.0
Fix Version/s: 3.3.0
Component/s: Spark Core
Labels:
- BlockManager
- core

Description

Note: This is a follow-up on ~~SPARK-34949~~, even after the heartbeat fix, driver reports dead executors as alive.

Problem:

I was testing Dynamic Allocation on K8s with about 300 executors. While doing so, when the executors were torn down due to "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor pods being removed from K8s, however, under the "Executors" tab in SparkUI, I could see some executors listed as alive.

spark.sparkContext.statusTracker.getExecutorInfos.length also returned a value greater than 1.

Cause:

"CoarseGrainedSchedulerBackend" issues async "StopExecutor" on executorEndpoint
"CoarseGrainedSchedulerBackend" removes that executor from Driver's internal data structures and publishes "SparkListenerExecutorRemoved" on the "listenerBus".
Executor has still not processed "StopExecutor" from the Driver
Driver receives heartbeat from the Executor, since it cannot find the "executorId" in its data structures, it responds with "HeartbeatResponse(reregisterBlockManager = true)"
"BlockManager" on the Executor reregisters with the "BlockManagerMaster" and "SparkListenerBlockManagerAdded" is published on the "listenerBus"
Executor starts processing the "StopExecutor" and exits
"AppStatusListener" picks the "SparkListenerBlockManagerAdded" event and updates "AppStatusStore"
"statusTracker.getExecutorInfos" refers "AppStatusStore" to get the list of executors which returns the dead executor as alive.

Proposed Solution:

Maintain a Cache of recently removed executors on Driver. During the registration in BlockManagerMasterEndpoint if the BlockManager belongs to a recently removed executor, return None indicating the registration is ignored since the executor will be shutting down soon.

On BlockManagerHeartbeat, if the BlockManager belongs to a recently removed executor, return true indicating the driver knows about it, thereby preventing reregisteration.