Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-1005

Zookeeper servers fail to elect a leader succesfully.

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 3.2.2
    • Fix Version/s: 3.6.0, 3.5.5
    • Component/s: quorum
    • Labels:
      None
    • Environment:

      zookeeper-3.2.2; debian

      Description

      We were running 3 zookeeper servers, and simulated a failure on one of the servers.

      The one zookeeper node follows the other, but has trouble connecting. It looks like the following exception is the cause:

      2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 INFO [zookeeper] --  [org.apache.zookeeper.server.quorum.QuorumPeer] FOLLOWING
      2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 INFO [zookeeper] --  [org.apache.zookeeper.server.ZooKeeperServer] Created server
      2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 INFO [zookeeper] --  [org.apache.zookeeper.server.quorum.Follower] Following zookeeper3/192.168.131.11:2888
      2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING [zookeeper] --  [org.apache.zookeeper.server.quorum.Follower] Unexpected exception, tries=0
      2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING java.net.ConnectException: --  Connection refused
      2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at java.net.PlainSocketImpl.socketConnect(Native Method)
      2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:310)
      2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:176)
      2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:163)
      2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
      2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at java.net.Socket.connect(Socket.java:546)
      2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:156)
      2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:549)
      

      The last exception while connecting was:

      2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR [zookeeper] --  [org.apache.zookeeper.server.quorum.Follower] Unexpected exception
      2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR java.net.ConnectException: --  Connection refused
      2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at java.net.PlainSocketImpl.socketConnect(Native Method)
      2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:310)
      2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:176)
      2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:163)
      2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
      2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at java.net.Socket.connect(Socket.java:546)
      2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:156)
      2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:549)
      2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 WARNING [zookeeper] --  [org.apache.zookeeper.server.quorum.Follower] Exception when following the leader
      

      The leader started leading a bit later

      2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  [org.apache.zookeeper.server.quorum.FastLeaderElection] Notification: 0, 94489312534, 25, 2, LOOKING, LOOKING, 0
      2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  [org.apache.zookeeper.server.quorum.FastLeaderElection] Adding vote
      2011-03-01T14:02:32+02:00 e0-cb-4e-65-4d-7d WARNING [zookeeper] --  [org.apache.zookeeper.server.quorum.QuorumCnxManager] Cannot open channel to 1 at election address zookeeper2/192.168.132.10:3888
      2011-03-01T14:02:32+02:00 e0-cb-4e-65-4d-7d WARNING  --     at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323)
      2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  [org.apache.zookeeper.server.quorum.QuorumPeer] LEADING
      

      But at that time the follower had already terminated and started a new election, so the leader failed:

      2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  [org.apache.zookeeper.server.ZooKeeperServer] Created server
      2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  [org.apache.zookeeper.server.persistence.FileSnap] Reading snapshot /var/lib/zookeeper/version-2/snapshot.1600007d16
      2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  [org.apache.zookeeper.server.persistence.FileTxnSnapLog] Snapshotting: 1600007d16
      2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING [zookeeper] --  [org.apache.zookeeper.server.quorum.QuorumCnxManager] Cannot open channel to 1 at election address zookeeper2/192.168.132.10:3888
      2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING  --     at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323)
      2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING  --     at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:302)
      2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING  --     at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:323)
      2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING  --     at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:296)
      2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  [org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new notification.
      2011-03-01T14:03:11+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  [org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new notification.
      2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING [zookeeper] --  [org.apache.zookeeper.server.quorum.QuorumCnxManager] Cannot open channel to 1 at election address zookeeper2/192.168.132.10:3888
      2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING  --     at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323)
      2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING  --     at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:302)
      2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING  --     at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:323)
      2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING  --     at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:296)
      2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  [org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new notification.
      2011-03-01T14:03:32+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  [org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new notification.
      2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  [org.apache.zookeeper.server.quorum.Leader] Shutdown called
      2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO  --     at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:371)
      2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO  --     at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:297)
      2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO  --     at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:562)
      

      From http://zookeeper.apache.org/doc/r3.2.2/zookeeperStarted.html:

      The new entry, initLimit is timeouts ZooKeeper uses to limit the length of time the ZooKeeper servers in quorum have to connect to a leader

      Since we have initLimit=10 and tickTime=4000, we should have 40 seconds for a zookeeper server to contact the leader.

      However, in the source code src/java/main/org/apache/zookeeper/server/quorum/Follower.java:

      152                 for (int tries = 0; tries < 5; tries++) {
      153                     try {
      154                         //sock = new Socket();
      155                         //sock.setSoTimeout(self.tickTime * self.initLimit);
      156                         sock.connect(addr, self.tickTime * self.syncLimit);
      157                         sock.setTcpNoDelay(nodelay);
      158                         break;
      159                     } catch (IOException e) {
      160                         if (tries == 4) {
      161                             LOG.error("Unexpected exception",e);
      162                             throw e;
      163                         } else {
      164                             LOG.warn("Unexpected exception, tries="+tries,e);
      165                             sock = new Socket();
      166                             sock.setSoTimeout(self.tickTime * self.initLimit);
      167                         }
      168                     }
      169                     Thread.sleep(1000);
      170                 }
      

      It appears as if we only have 4 seconds to contact the leader. The timeouts are applied to the socket, but do not take into account that the zookeeper leader may not have started its zookeeper service yet.

      Is this the expected behaviour? Or is the expected behaviour that followers should always be able to connect to the leader?

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              ahardy Alexandre Hardy
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: