Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-764

Race Condition in Broker Registration after ZooKeeper disconnect

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 0.7.1
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      When running our ZooKeepers in VMware, occasionally all the keepers simultaneously pause long enough for the Kafka clients to time out and then the keepers simultaneously un-pause.

      When this happens, the zk clients disconnect from ZooKeeper. When ZooKeeper comes back ZkUtils.createEphemeralPathExpectConflict discovers the node id of itself and does not re-register the broker id node and the function call succeeds. Then ZooKeeper figures out the broker disconnected from the keeper and deletes the ephemeral node after allowing the consumer to read the data in the /brokers/ids/x node. The broker then goes on to register all the topics, etc. When consumers connect, they see topic nodes associated with the broker but thy can't find the broker node to get connection information for the broker, sending them into a rebalance loop until they reach rebalance.retries.max and fail.

      This might also be a ZooKeeper issue, but the desired behavior for a disconnect case might be, if the broker node is found to explicitly delete and recreate it.

      1. BPPF_2900-Broker_Logs.tbz2
        12.51 MB
        Robert P. Thille

        Activity

        Hide
        rthille Robert P. Thille added a comment -

        I believe the issues started somewhere around the time of these log messages:
        [2017-05-25 07:08:25,528] INFO [Controller 2]: Broker 2 resigned as the controller (kafka.controller.KafkaController)
        [2017-05-25 07:09:02,522] INFO [Controller 2]: Broker 2 resigned as the controller (kafka.controller.KafkaController)

        Show
        rthille Robert P. Thille added a comment - I believe the issues started somewhere around the time of these log messages: [2017-05-25 07:08:25,528] INFO [Controller 2] : Broker 2 resigned as the controller (kafka.controller.KafkaController) [2017-05-25 07:09:02,522] INFO [Controller 2] : Broker 2 resigned as the controller (kafka.controller.KafkaController)
        Hide
        rthille Robert P. Thille added a comment -

        I believe we saw this issue, or something very similar.
        During a load test, we had a 3-node Kafka cluster which got into a confused state:
        Brokers 0 and 1 were happy and were listed in /brokers/ids/X in ZK, and Broker 2 was connected to ZK, but not listed in /brokers/ids/2 and brokers 0 & 1 had no connections to broker 2.
        Broker 2 was happily accepting new messages produced to it for hours. Eventually, it did rejoin the cluster, but the published messages were lost as the 0 & 1 brokers seemingly outvoted broker 2 about the partitions.

        Show
        rthille Robert P. Thille added a comment - I believe we saw this issue, or something very similar. During a load test, we had a 3-node Kafka cluster which got into a confused state: Brokers 0 and 1 were happy and were listed in /brokers/ids/X in ZK, and Broker 2 was connected to ZK, but not listed in /brokers/ids/2 and brokers 0 & 1 had no connections to broker 2. Broker 2 was happily accepting new messages produced to it for hours. Eventually, it did rejoin the cluster, but the published messages were lost as the 0 & 1 brokers seemingly outvoted broker 2 about the partitions.

          People

          • Assignee:
            Unassigned
            Reporter:
            bob.cotton@gmail.com Bob Cotton
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development