Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.3.5
-
None
-
Gentoo linux, but every environment is affected.
Description
In zookeeper 3.3.5 in file src/java/main/org/apache/zookeeper/server/quorum/Learner.java:325 you may see Thread.sleep(1000);
This is always happens after leader failure or restart. Zookeeper reelects new leader and all followers try to connect to it. But first attempt always fails because of "Connection refused":
2012-07-23 18:55:48,159 - WARN [QuorumPeer:/0.0.0.0:2181:Learner@229] - Unexpected exception, tries=0, connecting to web329.local/192.168.1.74:2888
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
at java.net.Socket.connect(Socket.java:529)
at org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:221)
at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:65)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:645)
I propose to change this line to the next code:
if (tries > 0) { Thread.sleep(self.tickTime); }
This way first reconnect attempt will be done immediately, other will wait for tick time (this is good semantic change, I suppose).
The result of this change - leader reelection time lowered from >1500ms to 300-400ms with 50ms tick time. This is pretty important for our production environment and will not break any existing installations.