Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-4229

Controller can't start after several zk expired event

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1
    • 0.10.2.0
    • controller

    Description

      We found the controller not started after several zk expired event in our test environment. By analysing the log, I found the controller will handle the ephemeral node data delete event first and then the zk expired event , then the controller will gone.
      I can reproducer it on my develop env:
      1. set up a one broker and one zk env, specify a very large zk timeout (20s)
      2. stop the broker and remove the zk's /broker/ids/0 directory
      3. restart the broker and make a breakpoint in the zk client's event thread to queue the delete event.
      4. after the /controller node gone the breakpoint will hit.
      5. expired the current session(suspend the send thread) and create a new session s2
      6. resume the event thread, then the controller will handle LeaderChangeListener.handleDataDeleted and become leader
      7. then controller will handle SessionExpirationListener.handleNewSession, it resign the controller and elect, but when elect it found the /controller node is exist and not become the leader. But the /controller node is created by current session s2 will not remove. So the controller is gone

      Attachments

        Issue Links

          Activity

            pengwei Pengwei added a comment -

            I already have a patch for this issue, maybe the issue can assign to me?

            pengwei Pengwei added a comment - I already have a patch for this issue, maybe the issue can assign to me?
            guozhang Guozhang Wang added a comment -

            I have assigned this JIRA to you pengwei. BTW which version of Kafka were you testing? I saw the affected versions from 0.9.0.0 to 0.10.0.1, so not sure which version it is testing.

            Also there are some known issues with the older versioned ZKClient such that events are not processed in exactly the firing order, and hence may cause various issues and cause some events be lost. ZkClient has been upgraded from older version to 0.10.0.x, I'm wondering if it has solved the problem or not.

            guozhang Guozhang Wang added a comment - I have assigned this JIRA to you pengwei . BTW which version of Kafka were you testing? I saw the affected versions from 0.9.0.0 to 0.10.0.1, so not sure which version it is testing. Also there are some known issues with the older versioned ZKClient such that events are not processed in exactly the firing order, and hence may cause various issues and cause some events be lost. ZkClient has been upgraded from older version to 0.10.0.x, I'm wondering if it has solved the problem or not.
            pengwei Pengwei added a comment -

            We test it on 0.9.0.0, but I found the controller code are nearly the same between these versions.
            In 0.9.0.0, zk version is 3.4.6

            pengwei Pengwei added a comment - We test it on 0.9.0.0, but I found the controller code are nearly the same between these versions. In 0.9.0.0, zk version is 3.4.6
            githubbot ASF GitHub Bot added a comment -

            GitHub user pengwei-li opened a pull request:

            https://github.com/apache/kafka/pull/2175

            KAFKA-4229:Controller can't start after several zk expired event

            Author: pengwei <pengwei.li@huawei.com>

            Reviewers: wangguoz.gmail.com

            You can merge this pull request into a Git repository by running:

            $ git pull https://github.com/pengwei-li/kafka trunk

            Alternatively you can review and apply these changes as the patch at:

            https://github.com/apache/kafka/pull/2175.patch

            To close this pull request, make a commit to your master/trunk branch
            with (at least) the following in the commit message:

            This closes #2175


            commit a920d4e9807add634cc44e4b7cf9e156edd515cf
            Author: pengwei-li <pengwei.li@huawei.com>
            Date: 2016-07-10T00:31:56Z

            KAFKA-1429: Yet another deadlock in controller shutdown

            Author: pengwei <pengwei.li@huawei.com>

            Reviewers: NA

            commit 2a5a4322c8ac359587f05b459588cd2b5843a2ac
            Author: pengwei-li <pengwei.li@huawei.com>
            Date: 2016-11-20T11:31:21Z

            Merge branch 'trunk' of https://github.com/apache/kafka into trunk

            commit b827a8b4f249050ca40db9f14e8e10b01650a6b8
            Author: pengwei-li <pengwei.li@huawei.com>
            Date: 2016-11-20T12:18:49Z

            Merge branch 'trunk' of https://github.com/apache/kafka into trunk

            commit 43e186f223dee1e24177a87ee6888eaae91547d9
            Author: pengwei-li <pengwei.li@huawei.com>
            Date: 2016-11-27T01:54:00Z

            Merge branch 'trunk' of https://github.com/apache/kafka into trunk

            commit febe4f433452a2ad8849a329bc5c9f4d1507a317
            Author: pengwei-li <pengwei.li@huawei.com>
            Date: 2016-11-27T03:31:26Z

            issue:KAFKA-4229
            reason: controoler can't start afeter several zk expired event


            githubbot ASF GitHub Bot added a comment - GitHub user pengwei-li opened a pull request: https://github.com/apache/kafka/pull/2175 KAFKA-4229 :Controller can't start after several zk expired event Author: pengwei <pengwei.li@huawei.com> Reviewers: wangguoz.gmail.com You can merge this pull request into a Git repository by running: $ git pull https://github.com/pengwei-li/kafka trunk Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/2175.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2175 commit a920d4e9807add634cc44e4b7cf9e156edd515cf Author: pengwei-li <pengwei.li@huawei.com> Date: 2016-07-10T00:31:56Z KAFKA-1429 : Yet another deadlock in controller shutdown Author: pengwei <pengwei.li@huawei.com> Reviewers: NA commit 2a5a4322c8ac359587f05b459588cd2b5843a2ac Author: pengwei-li <pengwei.li@huawei.com> Date: 2016-11-20T11:31:21Z Merge branch 'trunk' of https://github.com/apache/kafka into trunk commit b827a8b4f249050ca40db9f14e8e10b01650a6b8 Author: pengwei-li <pengwei.li@huawei.com> Date: 2016-11-20T12:18:49Z Merge branch 'trunk' of https://github.com/apache/kafka into trunk commit 43e186f223dee1e24177a87ee6888eaae91547d9 Author: pengwei-li <pengwei.li@huawei.com> Date: 2016-11-27T01:54:00Z Merge branch 'trunk' of https://github.com/apache/kafka into trunk commit febe4f433452a2ad8849a329bc5c9f4d1507a317 Author: pengwei-li <pengwei.li@huawei.com> Date: 2016-11-27T03:31:26Z issue: KAFKA-4229 reason: controoler can't start afeter several zk expired event
            pengwei Pengwei added a comment - The PR is : https://github.com/apache/kafka/pull/2175
            pengwei Pengwei added a comment - PR : https://github.com/apache/kafka/pull/2175
            githubbot ASF GitHub Bot added a comment -

            Github user asfgit closed the pull request at:

            https://github.com/apache/kafka/pull/2175

            githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/kafka/pull/2175
            junrao Jun Rao added a comment -

            Issue resolved by pull request 2175
            https://github.com/apache/kafka/pull/2175

            junrao Jun Rao added a comment - Issue resolved by pull request 2175 https://github.com/apache/kafka/pull/2175

            People

              pengwei Pengwei
              pengwei Pengwei
              Guozhang Wang Guozhang Wang
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: