Details
Description
Problem:
I was testing Dynamic Allocation on K8s with about 300 executors. While doing so, when the executors were torn down due to "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor pods being removed from K8s, however, under the "Executors" tab in SparkUI, I could see some executors listed as alive.
spark.sparkContext.statusTracker.getExecutorInfos.length also returned a value greater than 1.
Cause:
- "CoarseGrainedSchedulerBackend" issues RemoveExecutor on a "executorEndpoint" and publishes "SparkListenerExecutorRemoved" on the "listenerBus"
- "CoarseGrainedExecutorBackend" starts the executor shutdown
- "HeartbeatReceiver" picks the "SparkListenerExecutorRemoved" event and removes the executor from "executorLastSeen"
- In the meantime, the executor reports a Heartbeat. Now "HeartbeatReceiver" cannot find the "executorId" in "executorLastSeen" and hence responds with "HeartbeatResponse(reregisterBlockManager = true)"
- The Executor now calls "env.blockManager.reregister()" and reregisters itself thus creating inconsistency
Proposed Solution:
The "reportHeartBeat" method is not aware of the fact that Executor is shutting down, it should check "executorShutdown" before reregistering.