Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17022

Potential deadlock in driver handling message

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 2.0.0
    • 2.0.1, 2.1.0
    • Spark Core, YARN
    • None

    Description

      Suggest t1 < t2 < t3
      At t1, someone called YarnSchedulerBackend.doRequestTotalExecutors from one of three functions: CoarseGrainedSchedulerBackend.killExecutors, CoarseGrainedSchedulerBackend.requestTotalExecutors or CoarseGrainedSchedulerBackend.requestExecutors, in all of which will hold the lock `CoarseGrainedSchedulerBackend`.
      Then YarnSchedulerBackend.doRequestTotalExecutors will send a RequestExecutors message to `yarnSchedulerEndpoint` and wait for reply.

      At t2, someone send a RemoveExecutor to `yarnSchedulerEndpoint` and the message is received by the endpoint.

      At t3, the RequestExexutor message sent at t1 is received by the endpoint.

      Then the endpoint would first handle RemoveExecutor then the RequestExecutor message.

      When handling RemoveExecutor, it would send the same message to `driverEndpoint` and wait for reply.

      In `driverEndpoint` it will request lock `CoarseGrainedSchedulerBackend` to handle that message, while the lock has been occupied in t1.

      So it would cause a deadlock.

      We have found the issue in our deployment, it would block the driver to make it handle no messages until the two message all went timeout.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            WangTao Tao Wang
            WangTao Tao Wang
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment