ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1419

Leader election never settles for a 5-node cluster

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 3.4.3, 3.5.0
    • Fix Version/s: 3.4.4, 3.5.0
    • Component/s: leaderElection
    • Labels:
      None
    • Environment:

      64-bit Linux, all nodes running on the same machine (different ports)

      Description

      We have a situation where it seems to my untrained eye that leader election never finishes for a 5-node cluster. In this test, all nodes are ZK 3.4.3 and running on the same server (listening on different ports, of course). The nodes have server IDs of 0, 1, 2, 3, 4. The test brings up the cluster in different configurations, adding in a new node each time. We embed ZK in our application, so when we shut a node down and restart it with a new configuration, it all happens in a single JVM process. Here's our server startup code (for the case where there's more than one node in the cluster):

      if (servers.size() > 1) {
          _log.debug("Starting Zookeeper server in quorum server mode");
      
          _quorum_peer = new QuorumPeer();
          synchronized(_quorum_peer) {
              _quorum_peer.setClientPortAddress(clientAddr);
              _quorum_peer.setTxnFactory(log);
              _quorum_peer.setQuorumPeers(servers);
              _quorum_peer.setElectionType(_election_alg);
              _quorum_peer.setMyid(_server_id);
              _quorum_peer.setTickTime(_tick_time);
              _quorum_peer.setInitLimit(_init_limit);
              _quorum_peer.setSyncLimit(_sync_limit);
              QuorumVerifier quorumVerifier =
                  new QuorumMaj(servers.size());
              _quorum_peer.setQuorumVerifier(quorumVerifier);
              _quorum_peer.setCnxnFactory(_cnxn_factory);
              _quorum_peer.setZKDatabase(new ZKDatabase(log));
              _quorum_peer.start();
          }
      } else {
          _log.debug("Starting Zookeeper server in single server mode");
          _zk_server = new ZooKeeperServer();
          _zk_server.setTxnLogFactory(log);
          _zk_server.setTickTime(_tick_time);
          _cnxn_factory.startup(_zk_server);
      }
      

      And here's our shutdown code:

      if (_quorum_peer != null) {
          synchronized(_quorum_peer) {
              _quorum_peer.shutdown();
              FastLeaderElection fle =
                  (FastLeaderElection) _quorum_peer.getElectionAlg();
              fle.shutdown();
              try {
                  _quorum_peer.getTxnFactory().commit();
              } catch (java.nio.channels.ClosedChannelException e) {
                  // ignore
              }
          }
      } else {
          _cnxn_factory.shutdown();
          _zk_server.getTxnLogFactory().commit();
      }
      

      The test steps through the following scenarios in quick succession:

      Run 1: Start a 1-node cluster, servers=[0]
      Run 2: Start a 2-node cluster, servers=[0,3]
      Run 3: Start a 3-node cluster, servers=[0,1,3]
      Run 4: Start a 4-node cluster, servers=[0,1,2,3]
      Run 5: Start a 5-node cluster, servers=[0,1,2,3,4]

      It appears that run 5 never elects a leader – the nodes just keep spewing messages like this (example from node 0):

      2012-03-14 16:23:12,775 13308 [WorkerSender[myid=0]] DEBUG org.apache.zookeeper.server.quorum.QuorumCnxManager  - There is a connection already for server 2
      2012-03-14 16:23:12,776 13309 [QuorumPeer[myid=0]/127.0.0.1:2900] DEBUG org.apache.zookeeper.server.quorum.FastLeaderElection  - Sending Notification: 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), 3 (recipient), 0 (myid), 0x2 (n.peerEpoch)
      2012-03-14 16:23:12,776 13309 [WorkerSender[myid=0]] DEBUG org.apache.zookeeper.server.quorum.QuorumCnxManager  - There is a connection already for server 3
      2012-03-14 16:23:12,776 13309 [QuorumPeer[myid=0]/127.0.0.1:2900] DEBUG org.apache.zookeeper.server.quorum.FastLeaderElection  - Sending Notification: 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), 4 (recipient), 0 (myid), 0x2 (n.peerEpoch)
      2012-03-14 16:23:12,776 13309 [WorkerSender[myid=0]] DEBUG org.apache.zookeeper.server.quorum.QuorumCnxManager  - There is a connection already for server 4
      2012-03-14 16:23:12,776 13309 [WorkerReceiver[myid=0]] DEBUG org.apache.zookeeper.server.quorum.FastLeaderElection  - Receive new notification message. My id = 0
      2012-03-14 16:23:12,776 13309 [WorkerReceiver[myid=0]] INFO org.apache.zookeeper.server.quorum.FastLeaderElection  - Notification: 4 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 1 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)
      2012-03-14 16:23:12,776 13309 [WorkerReceiver[myid=0]] DEBUG org.apache.zookeeper.server.quorum.FastLeaderElection  - Receive new notification message. My id = 0
      2012-03-14 16:23:12,776 13309 [WorkerReceiver[myid=0]] INFO org.apache.zookeeper.server.quorum.FastLeaderElection  - Notification: 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x2 (n.peerEPoch), LOOKING (my state)
      2012-03-14 16:23:12,776 13309 [QuorumPeer[myid=0]/127.0.0.1:2900] DEBUG org.apache.zookeeper.server.quorum.FastLeaderElection  - Adding vote: from=1, proposed leader=3, proposed zxid=0x0, proposed election epoch=0x1
      2012-03-14 16:23:12,776 13309 [QuorumPeer[myid=0]/127.0.0.1:2900] DEBUG org.apache.zookeeper.server.quorum.FastLeaderElection  - id: 3, proposed id: 3, zxid: 0x0, proposed zxid: 0x0
      2012-03-14 16:23:12,776 13309 [QuorumPeer[myid=0]/127.0.0.1:2900] DEBUG org.apache.zookeeper.server.quorum.FastLeaderElection  - id: 3, proposed id: 3, zxid: 0x0, proposed zxid: 0x0
      2012-03-14 16:23:12,776 13309 [QuorumPeer[myid=0]/127.0.0.1:2900] DEBUG org.apache.zookeeper.server.quorum.FastLeaderElection  - id: 3, proposed id: 3, zxid: 0x0, proposed zxid: 0x0
      2012-03-14 16:23:12,776 13309 [QuorumPeer[myid=0]/127.0.0.1:2900] DEBUG org.apache.zookeeper.server.quorum.FastLeaderElection  - id: 4, proposed id: 3, zxid: 0x0, proposed zxid: 0x0
      2012-03-14 16:23:12,776 13309 [QuorumPeer[myid=0]/127.0.0.1:2900] DEBUG org.apache.zookeeper.server.quorum.FastLeaderElection  - id: 4, proposed id: 3, zxid: 0x0, proposed zxid: 0x0
      

      I'm guessing this means that nodes 3 and 4 are fighting over leadership, but I don't know enough about the leader election code to debug this any further. Attaching a tarball with the logs for each run and the data directories for each node (though I don't think any data is being written to ZK during the test).

      1. ZOOKEEPER-1419.patch
        5 kB
        Flavio Junqueira
      2. ZOOKEEPER-1419.patch
        5 kB
        Flavio Junqueira
      3. ZOOKEEPER-1419.patch
        1 kB
        Flavio Junqueira
      4. ZOOKEEPER-1419-fixed2.tgz
        2.09 MB
        Jeremy Stribling

        Activity

        Hide
        Hudson added a comment -

        Integrated in ZooKeeper-trunk #1501 (See https://builds.apache.org/job/ZooKeeper-trunk/1501/)
        ZOOKEEPER-1419. Leader election never settles for a 5-node cluster (flavio via camille) (Revision 1302736)

        Result = SUCCESS
        camille : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1302736
        Files :

        • /zookeeper/trunk/CHANGES.txt
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/FastLeaderElection.java
        • /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/FLEPredicateTest.java
        Show
        Hudson added a comment - Integrated in ZooKeeper-trunk #1501 (See https://builds.apache.org/job/ZooKeeper-trunk/1501/ ) ZOOKEEPER-1419 . Leader election never settles for a 5-node cluster (flavio via camille) (Revision 1302736) Result = SUCCESS camille : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1302736 Files : /zookeeper/trunk/CHANGES.txt /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/FastLeaderElection.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/FLEPredicateTest.java
        Hide
        Camille Fournier added a comment -

        I'm gonna check this in to trunk and 3.4 tonight.

        Show
        Camille Fournier added a comment - I'm gonna check this in to trunk and 3.4 tonight.
        Hide
        Jeremy Stribling added a comment -

        The patch seems to fix the problem our tests were hitting. Thanks again!

        Show
        Jeremy Stribling added a comment - The patch seems to fix the problem our tests were hitting. Thanks again!
        Hide
        Flavio Junqueira added a comment -

        yes, only 3.4+

        Show
        Flavio Junqueira added a comment - yes, only 3.4+
        Hide
        Patrick Hunt added a comment -

        Flavio this issue is only in 3.4+ right?

        Show
        Patrick Hunt added a comment - Flavio this issue is only in 3.4+ right?
        Hide
        Flavio Junqueira added a comment -

        Hi Camille, The logic is incorrect in trunk. This case in the test I propose, inspired by Jeremy's run:

        Assert.assertFalse (mock.predicate(4L, 0L, 0L, 3L, 0L, 2L));
        

        fails with trunk code. It shouldn't return true in the case the new vote has an earlier epoch, and it does because we don't have the parentheses placed correctly. This was introduced with the ZAB 1.0 changes, btw.

        About the tests, I felt was simpler and cleaner to have a more focused test instead of a test that tests the whole machinery. Do you think it is not sufficient?

        Show
        Flavio Junqueira added a comment - Hi Camille, The logic is incorrect in trunk. This case in the test I propose, inspired by Jeremy's run: Assert.assertFalse (mock.predicate(4L, 0L, 0L, 3L, 0L, 2L)); fails with trunk code. It shouldn't return true in the case the new vote has an earlier epoch, and it does because we don't have the parentheses placed correctly. This was introduced with the ZAB 1.0 changes, btw. About the tests, I felt was simpler and cleaner to have a more focused test instead of a test that tests the whole machinery. Do you think it is not sufficient?
        Hide
        Camille Fournier added a comment -

        I don't see why this is marked to 3.3.5, the logic there does not seem to be faulty at a glance.

        Do we want to add a test with a 5-member quorum or do we think the unit test on the predicate logic is enough?

        Show
        Camille Fournier added a comment - I don't see why this is marked to 3.3.5, the logic there does not seem to be faulty at a glance. Do we want to add a test with a 5-member quorum or do we think the unit test on the predicate logic is enough?
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12518759/ZOOKEEPER-1419.patch
        against trunk revision 1301840.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/995//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/995//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/995//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12518759/ZOOKEEPER-1419.patch against trunk revision 1301840. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/995//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/995//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/995//console This message is automatically generated.
        Hide
        Flavio Junqueira added a comment -

        Adding one important comment.

        Show
        Flavio Junqueira added a comment - Adding one important comment.
        Hide
        Flavio Junqueira added a comment -

        Patch including a test.

        Show
        Flavio Junqueira added a comment - Patch including a test.
        Hide
        Jeremy Stribling added a comment -

        Thanks Flavio, we will be testing this shortly and I'll let you know if it solves our problem.

        Jeremy, super! Would it be possible for you to contribute your tests to us for inclusion in our ZK testing infra? (in addition to the tests flavio is going to write).

        Unfortunately none of our tests are ZK unit tests, they all include our entire application. Specifically, they contain they embedded layer we have on top of ZK to support dynamic membership, and including that would pull in so many dependencies that I don't think it would be useful for ZK's tests. Also, I don't think open-sourcing that code is an option for us at this point. Sorry! Though it's obviously not an ideal solution, I'd be happy to manually run the tests on new release candidates sometimes.

        Show
        Jeremy Stribling added a comment - Thanks Flavio, we will be testing this shortly and I'll let you know if it solves our problem. Jeremy, super! Would it be possible for you to contribute your tests to us for inclusion in our ZK testing infra? (in addition to the tests flavio is going to write). Unfortunately none of our tests are ZK unit tests, they all include our entire application. Specifically, they contain they embedded layer we have on top of ZK to support dynamic membership, and including that would pull in so many dependencies that I don't think it would be useful for ZK's tests. Also, I don't think open-sourcing that code is an option for us at this point. Sorry! Though it's obviously not an ideal solution, I'd be happy to manually run the tests on new release candidates sometimes.
        Hide
        Patrick Hunt added a comment -

        Jeremy, super! Would it be possible for you to contribute your tests to us for inclusion in our ZK testing infra? (in addition to the tests flavio is going to write).

        Flavio it would be great if you could add some targeted tests to expand the coverage in the area in which this problem occurred.

        Show
        Patrick Hunt added a comment - Jeremy, super! Would it be possible for you to contribute your tests to us for inclusion in our ZK testing infra? (in addition to the tests flavio is going to write). Flavio it would be great if you could add some targeted tests to expand the coverage in the area in which this problem occurred.
        Hide
        Flavio Junqueira added a comment -

        Thanks for pointing to this problem, Jeremy. The parentheses are not right in fle totalOrderPredicate, which I fix in the attached patch. This is a preliminary patch, since I haven't had time to write a test yet, but it should fix this problem.

        Show
        Flavio Junqueira added a comment - Thanks for pointing to this problem, Jeremy. The parentheses are not right in fle totalOrderPredicate, which I fix in the attached patch. This is a preliminary patch, since I haven't had time to write a test yet, but it should fix this problem.
        Hide
        Jeremy Stribling added a comment -

        Fixed other node3 logs.

        Show
        Jeremy Stribling added a comment - Fixed other node3 logs.
        Hide
        Jeremy Stribling added a comment -

        Oops, actually all the other node3 files were wrong too. Sorry about that, it was very JV of me. Attaching another fixed tarball.

        Show
        Jeremy Stribling added a comment - Oops, actually all the other node3 files were wrong too. Sorry about that, it was very JV of me. Attaching another fixed tarball.
        Hide
        Jeremy Stribling added a comment -

        Fixed node3-run5.out

        Show
        Jeremy Stribling added a comment - Fixed node3-run5.out
        Hide
        Jeremy Stribling added a comment -

        Hey Jeremy! Looks like a misconfiguration.

        Oops, that was actually my dumbness – node3-run5.out actually contained data from node0's log. I just gathered the logs incorrectly. Attaching a corrected tarball.

        Show
        Jeremy Stribling added a comment - Hey Jeremy! Looks like a misconfiguration. Oops, that was actually my dumbness – node3-run5.out actually contained data from node0's log. I just gathered the logs incorrectly. Attaching a corrected tarball.
        Hide
        Patrick Hunt added a comment -

        Hey Jeremy! Looks like a misconfiguration. According to the log file for node3:

        2012-03-14 16:23:12,776 13309 [QuorumPeer[myid=0]/127.0.0.1:2900] DEBUG org.apache.zookeeper.server.quorum.FastLeaderElection - id: 3, proposed id: 3, zxid: 0x0, proposed zxid: 0x0

        Notice the "myid=0", should be "myid=3". Looks right on the other servers though.

        Show
        Patrick Hunt added a comment - Hey Jeremy! Looks like a misconfiguration. According to the log file for node3: 2012-03-14 16:23:12,776 13309 [QuorumPeer [myid=0] /127.0.0.1:2900] DEBUG org.apache.zookeeper.server.quorum.FastLeaderElection - id: 3, proposed id: 3, zxid: 0x0, proposed zxid: 0x0 Notice the "myid=0", should be "myid=3". Looks right on the other servers though.
        Hide
        Jeremy Stribling added a comment -

        Contains logs for each node in each run, and the data directories for all nodes.

        Show
        Jeremy Stribling added a comment - Contains logs for each node in each run, and the data directories for all nodes.

          People

          • Assignee:
            Flavio Junqueira
            Reporter:
            Jeremy Stribling
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development