Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.2.2
-
None
-
None
-
zookeeper-3.2.2; debian
Description
We were running 3 zookeeper servers, and simulated a failure on one of the servers.
The one zookeeper node follows the other, but has trouble connecting. It looks like the following exception is the cause:
2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 INFO [zookeeper] -- [org.apache.zookeeper.server.quorum.QuorumPeer] FOLLOWING 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 INFO [zookeeper] -- [org.apache.zookeeper.server.ZooKeeperServer] Created server 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 INFO [zookeeper] -- [org.apache.zookeeper.server.quorum.Follower] Following zookeeper3/192.168.131.11:2888 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING [zookeeper] -- [org.apache.zookeeper.server.quorum.Follower] Unexpected exception, tries=0 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING java.net.ConnectException: -- Connection refused 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at java.net.PlainSocketImpl.socketConnect(Native Method) 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:310) 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:176) 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:163) 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384) 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at java.net.Socket.connect(Socket.java:546) 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:156) 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:549)
The last exception while connecting was:
2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR [zookeeper] -- [org.apache.zookeeper.server.quorum.Follower] Unexpected exception 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR java.net.ConnectException: -- Connection refused 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at java.net.PlainSocketImpl.socketConnect(Native Method) 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:310) 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:176) 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:163) 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384) 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at java.net.Socket.connect(Socket.java:546) 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:156) 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:549) 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 WARNING [zookeeper] -- [org.apache.zookeeper.server.quorum.Follower] Exception when following the leader
The leader started leading a bit later
2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] -- [org.apache.zookeeper.server.quorum.FastLeaderElection] Notification: 0, 94489312534, 25, 2, LOOKING, LOOKING, 0 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] -- [org.apache.zookeeper.server.quorum.FastLeaderElection] Adding vote 2011-03-01T14:02:32+02:00 e0-cb-4e-65-4d-7d WARNING [zookeeper] -- [org.apache.zookeeper.server.quorum.QuorumCnxManager] Cannot open channel to 1 at election address zookeeper2/192.168.132.10:3888 2011-03-01T14:02:32+02:00 e0-cb-4e-65-4d-7d WARNING -- at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323) 2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] -- [org.apache.zookeeper.server.quorum.QuorumPeer] LEADING
But at that time the follower had already terminated and started a new election, so the leader failed:
2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] -- [org.apache.zookeeper.server.ZooKeeperServer] Created server 2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] -- [org.apache.zookeeper.server.persistence.FileSnap] Reading snapshot /var/lib/zookeeper/version-2/snapshot.1600007d16 2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] -- [org.apache.zookeeper.server.persistence.FileTxnSnapLog] Snapshotting: 1600007d16 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING [zookeeper] -- [org.apache.zookeeper.server.quorum.QuorumCnxManager] Cannot open channel to 1 at election address zookeeper2/192.168.132.10:3888 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING -- at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323) 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING -- at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:302) 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING -- at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:323) 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING -- at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:296) 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] -- [org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new notification. 2011-03-01T14:03:11+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] -- [org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new notification. 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING [zookeeper] -- [org.apache.zookeeper.server.quorum.QuorumCnxManager] Cannot open channel to 1 at election address zookeeper2/192.168.132.10:3888 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING -- at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323) 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING -- at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:302) 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING -- at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:323) 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING -- at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:296) 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] -- [org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new notification. 2011-03-01T14:03:32+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] -- [org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new notification. 2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] -- [org.apache.zookeeper.server.quorum.Leader] Shutdown called 2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO -- at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:371) 2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO -- at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:297) 2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO -- at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:562)
From http://zookeeper.apache.org/doc/r3.2.2/zookeeperStarted.html:
The new entry, initLimit is timeouts ZooKeeper uses to limit the length of time the ZooKeeper servers in quorum have to connect to a leader
Since we have initLimit=10 and tickTime=4000, we should have 40 seconds for a zookeeper server to contact the leader.
However, in the source code src/java/main/org/apache/zookeeper/server/quorum/Follower.java:
152 for (int tries = 0; tries < 5; tries++) { 153 try { 154 //sock = new Socket(); 155 //sock.setSoTimeout(self.tickTime * self.initLimit); 156 sock.connect(addr, self.tickTime * self.syncLimit); 157 sock.setTcpNoDelay(nodelay); 158 break; 159 } catch (IOException e) { 160 if (tries == 4) { 161 LOG.error("Unexpected exception",e); 162 throw e; 163 } else { 164 LOG.warn("Unexpected exception, tries="+tries,e); 165 sock = new Socket(); 166 sock.setSoTimeout(self.tickTime * self.initLimit); 167 } 168 } 169 Thread.sleep(1000); 170 }
It appears as if we only have 4 seconds to contact the leader. The timeouts are applied to the socket, but do not take into account that the zookeeper leader may not have started its zookeeper service yet.
Is this the expected behaviour? Or is the expected behaviour that followers should always be able to connect to the leader?