Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34949

Executor.reportHeartBeat reregisters blockManager even when Executor is shutting down

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.1, 3.2.0
    • 3.1.2, 3.2.0, 3.0.4
    • Spark Core
    • Resource Manager: K8s

    Description

      Problem:

      I was testing Dynamic Allocation on K8s with about 300 executors. While doing so, when the executors were torn down due to "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor pods being removed from K8s, however, under the "Executors" tab in SparkUI, I could see some executors listed as alive. 

      spark.sparkContext.statusTracker.getExecutorInfos.length also returned a value greater than 1. 

       

      Cause:

      • "CoarseGrainedSchedulerBackend" issues RemoveExecutor on a "executorEndpoint" and publishes "SparkListenerExecutorRemoved" on the "listenerBus"
      • "CoarseGrainedExecutorBackend" starts the executor shutdown
      • "HeartbeatReceiver" picks the "SparkListenerExecutorRemoved" event and removes the executor from "executorLastSeen"
      • In the meantime, the executor reports a Heartbeat. Now "HeartbeatReceiver" cannot find the "executorId" in "executorLastSeen" and hence responds with "HeartbeatResponse(reregisterBlockManager = true)"
      • The Executor now calls "env.blockManager.reregister()" and reregisters itself thus creating inconsistency

       

      Proposed Solution:

      The "reportHeartBeat" method is not aware of the fact that Executor is shutting down, it should check "executorShutdown" before reregistering. 

      Attachments

        Activity

          People

            sumeet.gajjar Sumeet
            sumeet.gajjar Sumeet
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: