Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-35011

Avoid Block Manager registerations when StopExecutor msg is in-flight.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.1, 3.2.0
    • 3.3.0
    • Spark Core

    Description

      Note: This is a follow-up on SPARK-34949, even after the heartbeat fix, driver reports dead executors as alive.

      Problem:

      I was testing Dynamic Allocation on K8s with about 300 executors. While doing so, when the executors were torn down due to "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor pods being removed from K8s, however, under the "Executors" tab in SparkUI, I could see some executors listed as alive. 

      spark.sparkContext.statusTracker.getExecutorInfos.length also returned a value greater than 1. 

       

      Cause:

      • "CoarseGrainedSchedulerBackend" issues async "StopExecutor" on executorEndpoint
      • "CoarseGrainedSchedulerBackend" removes that executor from Driver's internal data structures and publishes "SparkListenerExecutorRemoved" on the "listenerBus".
      • Executor has still not processed "StopExecutor" from the Driver
      • Driver receives heartbeat from the Executor, since it cannot find the "executorId" in its data structures, it responds with "HeartbeatResponse(reregisterBlockManager = true)"
      • "BlockManager" on the Executor reregisters with the "BlockManagerMaster" and "SparkListenerBlockManagerAdded" is published on the "listenerBus"
      • Executor starts processing the "StopExecutor" and exits
      • "AppStatusListener" picks the "SparkListenerBlockManagerAdded" event and updates "AppStatusStore"
      • "statusTracker.getExecutorInfos" refers "AppStatusStore" to get the list of executors which returns the dead executor as alive.

       

      Proposed Solution:

      Maintain a Cache of recently removed executors on Driver. During the registration in BlockManagerMasterEndpoint if the BlockManager belongs to a recently removed executor, return None indicating the registration is ignored since the executor will be shutting down soon.

      On BlockManagerHeartbeat, if the BlockManager belongs to a recently removed executor, return true indicating the driver knows about it, thereby preventing reregisteration.

      Attachments

        Activity

          People

            Ngone51 wuyi
            sumeet.gajjar Sumeet
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: