Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34949

Executor.reportHeartBeat reregisters blockManager even when Executor is shutting down

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.1.1, 3.2.0
    • Fix Version/s: 3.1.2, 3.2.0, 3.0.4
    • Component/s: Spark Core
    • Environment:

      Resource Manager: K8s

      Description

      Problem:

      I was testing Dynamic Allocation on K8s with about 300 executors. While doing so, when the executors were torn down due to "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor pods being removed from K8s, however, under the "Executors" tab in SparkUI, I could see some executors listed as alive. 

      spark.sparkContext.statusTracker.getExecutorInfos.length also returned a value greater than 1. 

       

      Cause:

      • "CoarseGrainedSchedulerBackend" issues RemoveExecutor on a "executorEndpoint" and publishes "SparkListenerExecutorRemoved" on the "listenerBus"
      • "CoarseGrainedExecutorBackend" starts the executor shutdown
      • "HeartbeatReceiver" picks the "SparkListenerExecutorRemoved" event and removes the executor from "executorLastSeen"
      • In the meantime, the executor reports a Heartbeat. Now "HeartbeatReceiver" cannot find the "executorId" in "executorLastSeen" and hence responds with "HeartbeatResponse(reregisterBlockManager = true)"
      • The Executor now calls "env.blockManager.reregister()" and reregisters itself thus creating inconsistency

       

      Proposed Solution:

      The "reportHeartBeat" method is not aware of the fact that Executor is shutting down, it should check "executorShutdown" before reregistering. 

        Attachments

          Activity

            People

            • Assignee:
              sumeet.gajjar Sumeet
              Reporter:
              sumeet.gajjar Sumeet
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: