Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-2193

Intermittent network + DNS issues can cause brokers to permanently drop out of a cluster

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 0.8.1.1
    • None
    • None

    Description

      Our Kafka cluster recently experienced some intermittent network & DNS resolution issues such that this call to connect to Zookeeper failed with an UnknownHostException:

      https://github.com/sgroschupf/zkclient/blob/0630c9c6e67ab49a51e80bfd939e4a0d01a69dfe/src/main/java/org/I0Itec/zkclient/ZkConnection.java#L67

      We observed this happen during a processStateChanged(KeeperState.Expired) call:

      https://github.com/sgroschupf/zkclient/blob/0630c9c6e67ab49a51e80bfd939e4a0d01a69dfe/src/main/java/org/I0Itec/zkclient/ZkClient.java#L649

      the session expiry was in turn caused by what we suspect to be intermittent network issues.

      The failed ZK reconnect seemed to put ZkClient into a state where it would never recover and the Kafka broker into a state where it would need a restart to rejoin the cluster: ZkConnection._zk == null, 0.3.x doesn't appear to automatically try to make further attempts to reconnect after the failure, and obviously no further state transitions seem likely to happen without a connection to ZK.

      The newer zkclient 0.4.0/0.5.0 releases will helpfully fire a notification when this occurs, so the brokers have an opportunity to handle this sort of failure in a more graceful manner (e.g. by trying to reconnect after some backoff period):

      https://github.com/sgroschupf/zkclient/blob/0.4.0/src/main/java/org/I0Itec/zkclient/ZkClient.java#L461

      Happy to provide more info here if I can.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              thomaslee Tom Lee
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: