Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-1451

Broker stuck due to leader election race

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 0.8.1.1
    • 0.8.2.0
    • core

    Description

      Symptoms

      The broker does not become available due to being stuck in an infinite loop while electing leader. This can be recognised by the following line being repeatedly written to server.log:

      [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node [{"version":1,"brokerid":1,"timestamp":"1400060079108"}] at /controller a while back in a different session, hence I will backoff for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
      

      Steps to Reproduce

      In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely behave the same with the ZK version included in Kafka distribution) node setup:

      1. start both zookeeper and kafka (in any order)
      2. stop zookeeper
      3. stop kafka
      4. start kafka
      5. start zookeeper

      Likely Cause

      ZookeeperLeaderElector subscribes to data changes on startup, and then triggers an election. if the deletion of ephemeral /controller node associated with previous zookeeper session of the broker happens after subscription to changes in new session, election will be invoked twice, once from startup and once from handleDataDeleted:

      • startup: acquire controllerLock
      • startup: subscribe to data changes
      • zookeeper: delete /controller since the session that created it timed out
      • handleDataDeleted: /controller was deleted
      • handleDataDeleted: wait on controllerLock
      • startup: elect – writes /controller
      • startup: release controllerLock
      • handleDataDeleted: acquire controllerLock
      • handleDataDeleted: elect – attempts to write /controller and then gets into infinite loop as a result of conflict

      createEphemeralPathExpectConflictHandleZKBug assumes that the existing znode was written from different session, which is not true in this case; it was written from the same session. That adds to the confusion.

      Suggested Fix

      In ZookeeperLeaderElector.startup first run elect and then subscribe to data changes.

      Attachments

        1. KAFKA-1451.patch
          2 kB
          Manikumar
        2. KAFKA-1451_2014-07-29_10:13:23.patch
          2 kB
          Manikumar
        3. KAFKA-1451_2014-07-28_20:27:32.patch
          2 kB
          Manikumar

        Issue Links

          Activity

            People

              omkreddy Manikumar
              mmakowski Maciek Makowski
              Votes:
              2 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: