Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34949

Executor.reportHeartBeat reregisters blockManager even when Executor is shutting down

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.1, 3.2.0
    • 3.1.2, 3.2.0, 3.0.4
    • Spark Core
    • Resource Manager: K8s

    Description

      Problem:

      I was testing Dynamic Allocation on K8s with about 300 executors. While doing so, when the executors were torn down due to "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor pods being removed from K8s, however, under the "Executors" tab in SparkUI, I could see some executors listed as alive. 

      spark.sparkContext.statusTracker.getExecutorInfos.length also returned a value greater than 1. 

       

      Cause:

      • "CoarseGrainedSchedulerBackend" issues RemoveExecutor on a "executorEndpoint" and publishes "SparkListenerExecutorRemoved" on the "listenerBus"
      • "CoarseGrainedExecutorBackend" starts the executor shutdown
      • "HeartbeatReceiver" picks the "SparkListenerExecutorRemoved" event and removes the executor from "executorLastSeen"
      • In the meantime, the executor reports a Heartbeat. Now "HeartbeatReceiver" cannot find the "executorId" in "executorLastSeen" and hence responds with "HeartbeatResponse(reregisterBlockManager = true)"
      • The Executor now calls "env.blockManager.reregister()" and reregisters itself thus creating inconsistency

       

      Proposed Solution:

      The "reportHeartBeat" method is not aware of the fact that Executor is shutting down, it should check "executorShutdown" before reregistering. 

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            sumeet.gajjar Sumeet
            sumeet.gajjar Sumeet
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment