Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-3976

Kafka Controller gets stuck, potentially due to zookeeper session id reuse

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.9.0.1, 0.10.0.0
    • None
    • None
    • None

    Description

      We've been seeing an issue with a kafka controller getting "stuck" under 0.9.0.1. We carefully monitor ActiveControllerCount on all brokers, and upon finding it to be 0 for long enough, page an operator. In this particular case (and we've seen similar a few times before), it appears as though there were some weird conditions leading to the cluster not re-electing a controller.

      This could also potentially be a zookeeper client bug. We've spent a while looking through the code, and it seems like a session gets re-established, but the zookeeper session state callbacks don't seem to get fired on all the listeners properly, which leads to the controller being stuck.

      Note that with the cluster in this bad state, we've got a znode at /controller and /brokers/ids/0 on this bad node:

      get /controller
      {"version":1,"brokerid":0,"timestamp":"1468870986582"}
      cZxid = 0x20000a646
      ctime = Mon Jul 18 19:43:06 UTC 2016
      mZxid = 0x20000a646
      mtime = Mon Jul 18 19:43:06 UTC 2016
      pZxid = 0x20000a646
      cversion = 0
      dataVersion = 0
      aclVersion = 0
      ephemeralOwner = 0x155e49f2b230004
      dataLength = 54
      numChildren = 0
      get /brokers/ids/0
      {"jmx_port":-1,"timestamp":"1468870990738","endpoints":["SSL://REDACTED"],"host":null,"version":2,"port":-1}
      cZxid = 0x20000a64e
      ctime = Mon Jul 18 19:43:10 UTC 2016
      mZxid = 0x20000a64e
      mtime = Mon Jul 18 19:43:10 UTC 2016
      pZxid = 0x20000a64e
      cversion = 0
      dataVersion = 0
      aclVersion = 0
      ephemeralOwner = 0x155e49f2b230004
      dataLength = 144
      numChildren = 0
      

      However, that broker definitely doesn't have any ActiveControllerCount (this is a response out of jolokia):

      {"request":{"mbean":"kafka.controller:name=ActiveControllerCount,type=KafkaController","type":"read"},"value":{"Value":0},"timestamp":1468936463,"status":200}
      

      If you look at the attached log files, you can see that both brokers went through a brief period where they thought they were the controller, then both resigned and the callbacks meant to fire upon establishment of new sessions did not fire properly, which leads to a "stuck" controller.

      It's perhaps worth noting that these brokers have no more log output at all since this point. They are relatively unused though, so that is not that surprising.

      Logs are here (this is a two broker cluster): https://gist.github.com/tcrayford/cbf3d5aa1c46194eb7a98786d83b0eab
      Sensitive data have been redacted, replaced with REDACTED_$TYPE
      The most interesting stuff happens right at the bottom of each log file, where it appears a zookeeper session is timed out, then re-established. From my understanding of the ZAB protocol, Kafka shouldn't reuse session ids here, but it seems to be. That's potentially the issue.

      Attachments

        Activity

          People

            Unassigned Unassigned
            tcrayford-heroku Tom Crayford
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: