We were running 3 zookeeper servers, and simulated a failure on one of the servers.
The one zookeeper node follows the other, but has trouble connecting. It looks like the following exception is the cause:
The last exception while connecting was:
The leader started leading a bit later
But at that time the follower had already terminated and started a new election, so the leader failed:
The new entry, initLimit is timeouts ZooKeeper uses to limit the length of time the ZooKeeper servers in quorum have to connect to a leader
Since we have initLimit=10 and tickTime=4000, we should have 40 seconds for a zookeeper server to contact the leader.
However, in the source code src/java/main/org/apache/zookeeper/server/quorum/Follower.java:
It appears as if we only have 4 seconds to contact the leader. The timeouts are applied to the socket, but do not take into account that the zookeeper leader may not have started its zookeeper service yet.
Is this the expected behaviour? Or is the expected behaviour that followers should always be able to connect to the leader?