Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-4360

Controller may deadLock when autoLeaderRebalance encounter zk expired

Agile BoardAttach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1
    • 0.10.2.0
    • controller
    • Important

    Description

      when controller has checkAndTriggerPartitionRebalance task in autoRebalanceScheduler,and then zk expired at that time. It will
      run into deadlock.

      we can restore the scene as below,when zk session expired,zk thread will call handleNewSession which defined in SessionExpirationListener, and it will get controllerContext.controllerLock,and then it will autoRebalanceScheduler.shutdown(),which need complete all the task in the autoRebalanceScheduler,but that threadPoll also need get controllerContext.controllerLock,but it has already owned by zk callback thread,which will then run into deadlock.

      because of that,it will cause two problems at least, first is the broker’s id is cannot register to the zookeeper,and it will be considered as dead by new controller,second this procedure can not be stop by kafka-server-stop.sh, because shutdown function
      can not get controllerContext.controllerLock also, we cannot shutdown kafka except using kill -9.

      In my attachment, I upload a jstack file, which was created when my kafka procedure cannot shutdown by kafka-server-stop.sh.

      I have met this scenes for several times,I think this may be a bug that not solved in kafka.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            Json Tu tuyang
            Jiangjie Qin Jiangjie Qin
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 168h
              168h
              Remaining:
              Remaining Estimate - 168h
              168h
              Logged:
              Time Spent - Not Specified
              Not Specified

              Slack

                Issue deployment