Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
0.8.1.1
-
None
-
None
Description
Our Kafka cluster recently experienced some intermittent network & DNS resolution issues such that this call to connect to Zookeeper failed with an UnknownHostException:
We observed this happen during a processStateChanged(KeeperState.Expired) call:
the session expiry was in turn caused by what we suspect to be intermittent network issues.
The failed ZK reconnect seemed to put ZkClient into a state where it would never recover and the Kafka broker into a state where it would need a restart to rejoin the cluster: ZkConnection._zk == null, 0.3.x doesn't appear to automatically try to make further attempts to reconnect after the failure, and obviously no further state transitions seem likely to happen without a connection to ZK.
The newer zkclient 0.4.0/0.5.0 releases will helpfully fire a notification when this occurs, so the brokers have an opportunity to handle this sort of failure in a more graceful manner (e.g. by trying to reconnect after some backoff period):
Happy to provide more info here if I can.
Attachments
Issue Links
- duplicates
-
KAFKA-5473 handle ZK session expiration properly when a new session can't be established
- Resolved