Affects Version/s: 3.5.7
Fix Version/s: None
When we restart a zookeeper, it doesn't successfully join the cluster and start serving clients. We see the zookeeper services starts successfully, but it stays ideal and throws the message: "This ZooKeeper instance is not currently serving requests"
The Zookeeper cluster size is 5. Whenever we feel the need of restarting the zookeepers, we do one at a time. There are two ways we restart the zookeepers,
- just stop the services and start it back up again.
- stop the services, replace the host, and start it back up again.
And, in both the cases we see the same issue.
When investigated the zookeepers logs, we see the below errors/warnings,
"[QuorumPeer[myid=1](plain=x.x.x.x:0000)(secure=disabled)] WARN org.apache.zookeeper.server.quorum.Learner - Exception when following the leader
java.io.IOException: Leaders epoch, xx is less than accepted epoch, xy"
But, when we check the current epoch of the leader is always same as the accepted epoch, which is also matches of the zookeeper we are trying to bring back to the quorum.
Also, when we get the Zxid of every quorum member, they have the same first byte; only the last two numbers change, so we can safely assume that they are in sync, I guess.
Somehow this zookeeper that we re restarting sees an advancing of the epoch and shuts down as a follower.
The current solution we have at the moment for this issue is,
stop the zookeeper services --> rename the current zookeeper data directory (version-2) --> start it backup again.
It immediately joins the cluster as a follower as it doesn't have any idea of the epoch and start serving clients.