Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-15262

race condition in killing an executor and reregistering an executor

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.6.1
    • 1.6.2, 2.0.0
    • Spark Core
    • None

    Description

      There is a race condition when killing an executor and reregistering an executor happen at the same time. Here is the execution steps to reproduce it.

      1. master find a worker is dead
      2. master tells driver to remove executor
      3. driver remove executor
      4. BlockManagerMasterEndpoint remove the block manager
      5. executor finds it's not reigstered via heartbeat
      6. executor send reregister block manager
      7. register block manager
      8. executor is killed by worker
      9. CoarseGrainedSchedulerBackend ignores onDisconnected as this address is not in the executor list
      10. BlockManagerMasterEndpoint.blockManagerInfo contains dead block managers

      As BlockManagerMasterEndpoint.blockManagerInfo contains some dead block managers, when we unpersist a RDD, remove a broadcast, or clean a shuffle block via a RPC endpoint of a dead block manager, we will get ClosedChannelException.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            andrewor14 Andrew Or
            zsxwing Shixiong Zhu
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment