[SPARK-34949] Executor.reportHeartBeat reregisters blockManager even when Executor is shutting down - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.1.1, 3.2.0
Fix Version/s: 3.1.2, 3.2.0, 3.0.4
Component/s: Spark Core
Labels:
- Executor
- heartbeat
Environment:

Resource Manager: K8s

Description

Problem:

I was testing Dynamic Allocation on K8s with about 300 executors. While doing so, when the executors were torn down due to "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor pods being removed from K8s, however, under the "Executors" tab in SparkUI, I could see some executors listed as alive.

spark.sparkContext.statusTracker.getExecutorInfos.length also returned a value greater than 1.

Cause:

"CoarseGrainedSchedulerBackend" issues RemoveExecutor on a "executorEndpoint" and publishes "SparkListenerExecutorRemoved" on the "listenerBus"
"CoarseGrainedExecutorBackend" starts the executor shutdown
"HeartbeatReceiver" picks the "SparkListenerExecutorRemoved" event and removes the executor from "executorLastSeen"
In the meantime, the executor reports a Heartbeat. Now "HeartbeatReceiver" cannot find the "executorId" in "executorLastSeen" and hence responds with "HeartbeatResponse(reregisterBlockManager = true)"
The Executor now calls "env.blockManager.reregister()" and reregisters itself thus creating inconsistency

Proposed Solution:

The "reportHeartBeat" method is not aware of the fact that Executor is shutting down, it should check "executorShutdown" before reregistering.

Attachments

Issue Links

links to

[Github] Pull Request #32043 (sumeetgajjar)

[Github] Pull Request #33770 (sumeetgajjar)

Activity

People

Assignee:: Sumeet

Reporter:: Sumeet

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 03/Apr/21 05:49

Updated:: 18/Aug/21 23:24

Resolved:: 05/Apr/21 22:34