Affects Version/s: 3.4.13
Fix Version/s: None
I saw this issue in one of the zookeeper cluster where LEADER host crashed due to h/w issue and all the follower hosts immediately closed client connection with leader host but zk-cluster did not trigger leader reelection until we manually restarted few zookeeper host after some time.
Snapshot size: 200MB
ZooKeeper version: 3.4.13
Below logs are from Leader and one of the follower host, when Leader host was crashed.
1. Leader host: zookeeper-4.us-west.com crashed at 16:21:09 and no application logs.
2. Follower host: zookeeper-2.us-west.com immediately closed client-connection with zookeeper-4.us-west.com at 16:21:09
3. None of the Follower hosts triggered LEADER reelection and don't see any such logs
4. All the zk-clients started getting zookeeper session timeouts
5. after 27 minutes at 16:48:30, Follower host found out the issue in quorum and logged
"Have smaller server identifier, so dropping the connection"
6. But it still did not trigger leader reelection until we manually restarted zk-process in few zk hosts after 5 mins
7. and finally after restarting process in few zk hosts manually: zk cluster did leader reelection and created the quorum
Leader zookeeper: zookeeper-4.us-west.com
Follower zookeeper: zookeeper-2.us-west.com