ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-917

Leader election selected incorrect leader

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Not a Problem
    • Affects Version/s: 3.2.2
    • Fix Version/s: None
    • Component/s: leaderElection, server
    • Labels:
      None
    • Environment:

      Cloudera distribution of zookeeper (patched to never cache DNS entries)
      Debian lenny

      Description

      We had three nodes running zookeeper:

      • 192.168.130.10
      • 192.168.130.11
      • 192.168.130.14

      192.168.130.11 failed, and was replaced by a new node 192.168.130.13 (automated startup). The new node had not participated in any zookeeper quorum previously. The node 192.148.130.11 was permanently removed from service and could not contribute to the quorum any further (powered off).

      DNS entries were updated for the new node to allow all the zookeeper servers to find the new node.

      The new node 192.168.130.13 was selected as the LEADER, despite the fact that it had not seen the latest zxid.

      This particular problem has not been verified with later versions of zookeeper, and no attempt has been made to reproduce this problem as yet.

        Activity

        Hide
        Alexandre Hardy added a comment -

        Logs for the remaining nodes attached.

        Show
        Alexandre Hardy added a comment - Logs for the remaining nodes attached.
        Hide
        Alexandre Hardy added a comment -

        Excerpt from logs on 192.168.130.10:

        2010-11-02 09:36:28,060 INFO org.apache.zookeeper.server.quorum.FastLeaderElection: New election: 4294967742
        2010-11-02 09:36:28,061 WARN org.apache.zookeeper.server.NIOServerCnxn: Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
        2010-11-02 09:36:28,061 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x0 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/192.168.130.10:2181 remote=/192.168.130.10:37781]
        2010-11-02 09:36:28,061 INFO org.apache.zookeeper.server.quorum.FastLeaderElection: Notification: 0, 4294967742, 2, 0, LOOKING, LOOKING, 0
        2010-11-02 09:36:28,063 INFO org.apache.zookeeper.server.quorum.FastLeaderElection: Adding vote
        2010-11-02 09:36:28,064 WARN org.apache.zookeeper.server.NIOServerCnxn: Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
        2010-11-02 09:36:28,064 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x0 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/192.168.130.10:2181 remote=/192.168.130.14:50222]
        2010-11-02 09:36:28,064 INFO org.apache.zookeeper.server.quorum.FastLeaderElection: Notification: 2, -1, 1, 0, LOOKING, FOLLOWING, 1
        2010-11-02 09:36:28,065 WARN org.apache.zookeeper.server.NIOServerCnxn: Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
        2010-11-02 09:36:28,065 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x0 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/192.168.130.10:2181 remote=/192.168.130.14:50223]
        2010-11-02 09:36:28,068 WARN org.apache.zookeeper.server.NIOServerCnxn: Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
        2010-11-02 09:36:28,068 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x0 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/192.168.130.10:2181 remote=/192.168.130.12:59044]
        2010-11-02 09:36:28,073 INFO org.apache.zookeeper.server.quorum.FastLeaderElection: Notification: 2, -1, 1, 0, LOOKING, LEADING, 2
        2010-11-02 09:36:28,073 WARN org.apache.zookeeper.server.NIOServerCnxn: Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
        2010-11-02 09:36:28,073 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x0 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/192.168.130.10:2181 remote=/192.168.130.10:37786]
        2010-11-02 09:36:28,073 INFO org.apache.zookeeper.server.quorum.QuorumPeer: FOLLOWING
        2010-11-02 09:36:28,073 INFO org.apache.zookeeper.server.ZooKeeperServer: Created server 
        2010-11-02 09:36:28,074 INFO org.apache.zookeeper.server.quorum.Follower: Following zookeeper3/192.168.130.13:2888
        

        Excerpt from logs on 192.168.130.11:

        2010-11-02 09:36:14,065 INFO org.apache.zookeeper.server.quorum.QuorumPeerConfig: Defaulting to majority quorums
        2010-11-02 09:36:14,120 INFO org.apache.zookeeper.server.quorum.QuorumPeerMain: Starting quorum peer
        2010-11-02 09:36:14,172 INFO org.apache.zookeeper.server.quorum.QuorumCnxManager: My election bind port: 3888
        2010-11-02 09:36:14,182 INFO org.apache.zookeeper.server.quorum.QuorumPeer: LOOKING
        2010-11-02 09:36:14,183 INFO org.apache.zookeeper.server.quorum.FastLeaderElection: New election: -1
        2010-11-02 09:36:14,191 INFO org.apache.zookeeper.server.quorum.FastLeaderElection: Notification: 2, -1, 1, 2, LOOKING, LOOKING, 2
        2010-11-02 09:36:14,191 INFO org.apache.zookeeper.server.quorum.FastLeaderElection: Adding vote
        2010-11-02 09:36:14,193 WARN org.apache.zookeeper.server.quorum.QuorumCnxManager: Interrupted while waiting for message on queue
        java.lang.InterruptedException
            at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1952)
            at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
            at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:345)
            at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:532)
        2010-11-02 09:36:14,194 WARN org.apache.zookeeper.server.quorum.QuorumCnxManager: Send worker leaving thread
        2010-11-02 09:36:14,194 INFO org.apache.zookeeper.server.quorum.FastLeaderElection: Notification: 2, -1, 1, 2, LOOKING, LOOKING, 1
        2010-11-02 09:36:14,194 INFO org.apache.zookeeper.server.quorum.FastLeaderElection: Adding vote
        2010-11-02 09:36:14,195 WARN org.apache.zookeeper.server.quorum.QuorumCnxManager: Interrupted while waiting for message on queue
        java.lang.InterruptedException
            at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1952)
            at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
            at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:345)
            at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:532)
        2010-11-02 09:36:14,195 WARN org.apache.zookeeper.server.quorum.QuorumCnxManager: Send worker leaving thread
        2010-11-02 09:36:14,202 WARN org.apache.zookeeper.server.quorum.QuorumCnxManager: Connection broken:
        java.nio.channels.AsynchronousCloseException
            at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
            at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:281)
            at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:593)
        2010-11-02 09:36:14,401 INFO org.apache.zookeeper.server.quorum.FastLeaderElection: About to leave instance:2, -1, 2, LEADING
        2010-11-02 09:36:14,402 INFO org.apache.zookeeper.server.quorum.QuorumPeer: LEADING
        

        I'm not sure why 192.168.130.13 ended up as the leader when it did not have the most up to date transaction ID. Also, I don't see the notification messages of the other nodes in the logs of 192.168.130.13.

        Is there any reason why other nodes would accept 192.168.130.13 as the leader?

        Show
        Alexandre Hardy added a comment - Excerpt from logs on 192.168.130.10: 2010-11-02 09:36:28,060 INFO org.apache.zookeeper.server.quorum.FastLeaderElection: New election: 4294967742 2010-11-02 09:36:28,061 WARN org.apache.zookeeper.server.NIOServerCnxn: Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2010-11-02 09:36:28,061 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x0 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/192.168.130.10:2181 remote=/192.168.130.10:37781] 2010-11-02 09:36:28,061 INFO org.apache.zookeeper.server.quorum.FastLeaderElection: Notification: 0, 4294967742, 2, 0, LOOKING, LOOKING, 0 2010-11-02 09:36:28,063 INFO org.apache.zookeeper.server.quorum.FastLeaderElection: Adding vote 2010-11-02 09:36:28,064 WARN org.apache.zookeeper.server.NIOServerCnxn: Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2010-11-02 09:36:28,064 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x0 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/192.168.130.10:2181 remote=/192.168.130.14:50222] 2010-11-02 09:36:28,064 INFO org.apache.zookeeper.server.quorum.FastLeaderElection: Notification: 2, -1, 1, 0, LOOKING, FOLLOWING, 1 2010-11-02 09:36:28,065 WARN org.apache.zookeeper.server.NIOServerCnxn: Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2010-11-02 09:36:28,065 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x0 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/192.168.130.10:2181 remote=/192.168.130.14:50223] 2010-11-02 09:36:28,068 WARN org.apache.zookeeper.server.NIOServerCnxn: Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2010-11-02 09:36:28,068 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x0 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/192.168.130.10:2181 remote=/192.168.130.12:59044] 2010-11-02 09:36:28,073 INFO org.apache.zookeeper.server.quorum.FastLeaderElection: Notification: 2, -1, 1, 0, LOOKING, LEADING, 2 2010-11-02 09:36:28,073 WARN org.apache.zookeeper.server.NIOServerCnxn: Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2010-11-02 09:36:28,073 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x0 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/192.168.130.10:2181 remote=/192.168.130.10:37786] 2010-11-02 09:36:28,073 INFO org.apache.zookeeper.server.quorum.QuorumPeer: FOLLOWING 2010-11-02 09:36:28,073 INFO org.apache.zookeeper.server.ZooKeeperServer: Created server 2010-11-02 09:36:28,074 INFO org.apache.zookeeper.server.quorum.Follower: Following zookeeper3/192.168.130.13:2888 Excerpt from logs on 192.168.130.11: 2010-11-02 09:36:14,065 INFO org.apache.zookeeper.server.quorum.QuorumPeerConfig: Defaulting to majority quorums 2010-11-02 09:36:14,120 INFO org.apache.zookeeper.server.quorum.QuorumPeerMain: Starting quorum peer 2010-11-02 09:36:14,172 INFO org.apache.zookeeper.server.quorum.QuorumCnxManager: My election bind port: 3888 2010-11-02 09:36:14,182 INFO org.apache.zookeeper.server.quorum.QuorumPeer: LOOKING 2010-11-02 09:36:14,183 INFO org.apache.zookeeper.server.quorum.FastLeaderElection: New election: -1 2010-11-02 09:36:14,191 INFO org.apache.zookeeper.server.quorum.FastLeaderElection: Notification: 2, -1, 1, 2, LOOKING, LOOKING, 2 2010-11-02 09:36:14,191 INFO org.apache.zookeeper.server.quorum.FastLeaderElection: Adding vote 2010-11-02 09:36:14,193 WARN org.apache.zookeeper.server.quorum.QuorumCnxManager: Interrupted while waiting for message on queue java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1952) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987) at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:345) at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:532) 2010-11-02 09:36:14,194 WARN org.apache.zookeeper.server.quorum.QuorumCnxManager: Send worker leaving thread 2010-11-02 09:36:14,194 INFO org.apache.zookeeper.server.quorum.FastLeaderElection: Notification: 2, -1, 1, 2, LOOKING, LOOKING, 1 2010-11-02 09:36:14,194 INFO org.apache.zookeeper.server.quorum.FastLeaderElection: Adding vote 2010-11-02 09:36:14,195 WARN org.apache.zookeeper.server.quorum.QuorumCnxManager: Interrupted while waiting for message on queue java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1952) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987) at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:345) at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:532) 2010-11-02 09:36:14,195 WARN org.apache.zookeeper.server.quorum.QuorumCnxManager: Send worker leaving thread 2010-11-02 09:36:14,202 WARN org.apache.zookeeper.server.quorum.QuorumCnxManager: Connection broken: java.nio.channels.AsynchronousCloseException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:281) at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:593) 2010-11-02 09:36:14,401 INFO org.apache.zookeeper.server.quorum.FastLeaderElection: About to leave instance:2, -1, 2, LEADING 2010-11-02 09:36:14,402 INFO org.apache.zookeeper.server.quorum.QuorumPeer: LEADING I'm not sure why 192.168.130.13 ended up as the leader when it did not have the most up to date transaction ID. Also, I don't see the notification messages of the other nodes in the logs of 192.168.130.13. Is there any reason why other nodes would accept 192.168.130.13 as the leader?
        Hide
        Alexandre Hardy added a comment -

        Sorry, that should have been 192.168.130.13, not 192.168.130.11.

        Show
        Alexandre Hardy added a comment - Sorry, that should have been 192.168.130.13, not 192.168.130.11.
        Hide
        Alexandre Hardy added a comment -

        We noted that certain ephemeral nodes were no longer behaving as expected (started incrementing from zero again) and are concerned about the potential of data loss since the latest zxid's don't seem to be recognized by the leader.

        Show
        Alexandre Hardy added a comment - We noted that certain ephemeral nodes were no longer behaving as expected (started incrementing from zero again) and are concerned about the potential of data loss since the latest zxid's don't seem to be recognized by the leader.
        Hide
        Flavio Junqueira added a comment -

        Hi Alexandre, Could you please post your configuration parameters?

        I noticed the following in both excerpts:

        INFO org.apache.zookeeper.server.quorum.FastLeaderElection: Notification: 2, -1, 1, 2, LOOKING, LOOKING, 1
        INFO org.apache.zookeeper.server.quorum.FastLeaderElection: Notification: 2, -1, 1, 2, LOOKING, LOOKING, 2
        

        which implies that both servers, 1 and 2, were starting from scratch and in an ensemble of 3 servers they form a quorum.

        Show
        Flavio Junqueira added a comment - Hi Alexandre, Could you please post your configuration parameters? I noticed the following in both excerpts: INFO org.apache.zookeeper.server.quorum.FastLeaderElection: Notification: 2, -1, 1, 2, LOOKING, LOOKING, 1 INFO org.apache.zookeeper.server.quorum.FastLeaderElection: Notification: 2, -1, 1, 2, LOOKING, LOOKING, 2 which implies that both servers, 1 and 2, were starting from scratch and in an ensemble of 3 servers they form a quorum.
        Hide
        Alexandre Hardy added a comment -

        Hi Flavio,

        The three zookeeper servers are zookeeper1, zookeeper2 and zookeeper3.
        Initially the servers were

        • 192.168.130.10: zookeeper1
        • 192.168.130.11: zookeeper3
        • 192.168.130.14: zookeeper2

        After .11 was removed the servers were:

        • 192.168.130.10: zookeeper1
        • 192.168.130.13: zookeeper3
        • 192.168.130.14: zookeeper2

        All other settings were set by hbase:

        • tickTime=2000
        • initLimit=10
        • syncLimit=5
        • peerport=2888
        • leaderport=3888

        zookeeper1 would have node id 0
        zookeeper2 would have node id 1
        zookeeper3 would have node id 2

        I'm not sure what else I can give you concerning the configuration.

        I note that in 192.168.130.14 (node id 1) we have

        2010-11-02 09:36:27,988 INFO org.apache.zookeeper.server.quorum.FastLeaderElection: New election: 4294967742
        2010-11-02 09:36:27,988 INFO org.apache.zookeeper.server.quorum.FastLeaderElection: Notification: 1, 4294967742, 2, 1, LOOKING, LOOKING, 1
        2010-11-02 09:36:27,988 INFO org.apache.zookeeper.server.quorum.QuorumCnxManager: Have smaller server identifier, so dropping the connection: (2, 1)
        2010-11-02 09:36:27,988 INFO org.apache.zookeeper.server.quorum.FastLeaderElection: Adding vote
        2010-11-02 09:36:27,989 INFO org.apache.zookeeper.server.quorum.FastLeaderElection: Notification: 2, -1, 1, 1, LOOKING, FOLLOWING, 0
        

        I don't think there is much chance of some kind of networking configuration, but could that explain what we are seeing?

        Show
        Alexandre Hardy added a comment - Hi Flavio, The three zookeeper servers are zookeeper1, zookeeper2 and zookeeper3. Initially the servers were 192.168.130.10: zookeeper1 192.168.130.11: zookeeper3 192.168.130.14: zookeeper2 After .11 was removed the servers were: 192.168.130.10: zookeeper1 192.168.130.13: zookeeper3 192.168.130.14: zookeeper2 All other settings were set by hbase: tickTime=2000 initLimit=10 syncLimit=5 peerport=2888 leaderport=3888 zookeeper1 would have node id 0 zookeeper2 would have node id 1 zookeeper3 would have node id 2 I'm not sure what else I can give you concerning the configuration. I note that in 192.168.130.14 (node id 1) we have 2010-11-02 09:36:27,988 INFO org.apache.zookeeper.server.quorum.FastLeaderElection: New election: 4294967742 2010-11-02 09:36:27,988 INFO org.apache.zookeeper.server.quorum.FastLeaderElection: Notification: 1, 4294967742, 2, 1, LOOKING, LOOKING, 1 2010-11-02 09:36:27,988 INFO org.apache.zookeeper.server.quorum.QuorumCnxManager: Have smaller server identifier, so dropping the connection: (2, 1) 2010-11-02 09:36:27,988 INFO org.apache.zookeeper.server.quorum.FastLeaderElection: Adding vote 2010-11-02 09:36:27,989 INFO org.apache.zookeeper.server.quorum.FastLeaderElection: Notification: 2, -1, 1, 1, LOOKING, FOLLOWING, 0 I don't think there is much chance of some kind of networking configuration, but could that explain what we are seeing?
        Hide
        Flavio Junqueira added a comment -

        I downloaded your logs, but the out files are empty and I couldn't find the notification messages. By looking at the excerpts you posted, it sounds like node 1 tells 0 that it is following 2 and node says that it is following (this is fine as node 2 might have received some old messages), so node 0 must follow 2. Now the question is why node 1 decided to follow 2, specially because it has a higher zxid and the follower code should have rejected an attempt to follow a leader from an earlier epoch.

        It would be nice to have a look at the output of node 1.

        Show
        Flavio Junqueira added a comment - I downloaded your logs, but the out files are empty and I couldn't find the notification messages. By looking at the excerpts you posted, it sounds like node 1 tells 0 that it is following 2 and node says that it is following (this is fine as node 2 might have received some old messages), so node 0 must follow 2. Now the question is why node 1 decided to follow 2, specially because it has a higher zxid and the follower code should have rejected an attempt to follow a leader from an earlier epoch. It would be nice to have a look at the output of node 1.
        Hide
        Flavio Junqueira added a comment -

        Even though the logs do not make a lot of sense for me at this point, I was thinking that your scenario is not supposed to work given our guarantees. Let's look at an example.

        Suppose we have 3 servers: A, B, and C. Suppose that C is initially the leader and proposes operations that B is able to ack, but A doesn't. Now, suppose that I come and replace C with a fresh server, same id but empty state, and I do it before A and B are able to elect a new leader and recover. In this case, A and C may form a quorum and the state of the ZooKeeper ensemble would be empty. The replacement of server C with a fresh server violates our assumptions.

        It should work, though, if you add a fresh server with a working ensemble. That is, you let A and B elect a new leader, and then you start the new C server. In your case, I'm still not sure why it happens because the initial zxid of node 1 is 4294967742 according to your excerpt.

        Show
        Flavio Junqueira added a comment - Even though the logs do not make a lot of sense for me at this point, I was thinking that your scenario is not supposed to work given our guarantees. Let's look at an example. Suppose we have 3 servers: A, B, and C. Suppose that C is initially the leader and proposes operations that B is able to ack, but A doesn't. Now, suppose that I come and replace C with a fresh server, same id but empty state, and I do it before A and B are able to elect a new leader and recover. In this case, A and C may form a quorum and the state of the ZooKeeper ensemble would be empty. The replacement of server C with a fresh server violates our assumptions. It should work, though, if you add a fresh server with a working ensemble. That is, you let A and B elect a new leader, and then you start the new C server. In your case, I'm still not sure why it happens because the initial zxid of node 1 is 4294967742 according to your excerpt.
        Hide
        Alexandre Hardy added a comment -

        The excerpts are extracted from hbase-0.20/hbase*.log, so the information should be readily available.
        The tar file contents should be as follows:

        drwxr-xr-x ah/users          0 2010-11-02 14:42 192.168.130.10/
        drwxr-xr-x ah/users          0 2010-11-03 13:33 192.168.130.10/hbase-0.20/
        -rw-r--r-- ah/users          0 2010-11-02 14:42 192.168.130.10/hbase-0.20/hbase--zookeeper-e0-cb-4e-71-8-d3.out
        -rw-r--r-- ah/users   62922921 2010-11-02 14:42 192.168.130.10/hbase-0.20/hbase--zookeeper-e0-cb-4e-71-8-d3.log
        drwxr-xr-x ah/users          0 2010-11-02 14:42 192.168.130.12/
        drwxr-xr-x ah/users          0 2010-11-03 13:27 192.168.130.12/hbase-0.20/
        drwxr-xr-x ah/users          0 2010-11-02 14:42 192.168.130.13/
        drwxr-xr-x ah/users          0 2010-11-03 13:27 192.168.130.13/hbase-0.20/
        -rw-r--r-- ah/users   65903411 2010-11-02 14:42 192.168.130.13/hbase-0.20/hbase--zookeeper-e0-cb-4e-65-4d-4e.log
        -rw-r--r-- ah/users          0 2010-11-02 14:42 192.168.130.13/hbase-0.20/hbase--zookeeper-e0-cb-4e-65-4d-4e.out
        drwxr-xr-x ah/users          0 2010-11-02 14:42 192.168.130.14/
        drwxr-xr-x ah/users          0 2010-11-03 13:27 192.168.130.14/hbase-0.20/
        -rw-r--r-- ah/users          0 2010-11-02 14:42 192.168.130.14/hbase-0.20/hbase--zookeeper-e0-cb-4e-71-8-a8.out
        -rw-r--r-- ah/users   62835121 2010-11-02 14:42 192.168.130.14/hbase-0.20/hbase--zookeeper-e0-cb-4e-71-8-a8.log
        

        The only logs that are missing are those for .11, but that should not influence the analysis of the leader election (I hope).

        We are using monitoring software which determines when a zookeeper instance is no longer reachable, and automatically starts a fresh zookeeper instance as replacement. This software can determine the failure and start a new zookeeper instance fairly rapidly. Would it be better to delay the start of a fresh zookeeper instance to allow the existing instances to elect a new leader? If so, do you have any guidelines regarding this delay? (We are considering this approach, but would like to avoid it if possible).

        In your case, I'm still not sure why it happens because the initial zxid of node 1 is 4294967742 according to your excerpt.

        That is indeed the key question that I am trying to find an answer for!

        Show
        Alexandre Hardy added a comment - The excerpts are extracted from hbase-0.20/hbase*.log , so the information should be readily available. The tar file contents should be as follows: drwxr-xr-x ah/users 0 2010-11-02 14:42 192.168.130.10/ drwxr-xr-x ah/users 0 2010-11-03 13:33 192.168.130.10/hbase-0.20/ -rw-r--r-- ah/users 0 2010-11-02 14:42 192.168.130.10/hbase-0.20/hbase--zookeeper-e0-cb-4e-71-8-d3.out -rw-r--r-- ah/users 62922921 2010-11-02 14:42 192.168.130.10/hbase-0.20/hbase--zookeeper-e0-cb-4e-71-8-d3.log drwxr-xr-x ah/users 0 2010-11-02 14:42 192.168.130.12/ drwxr-xr-x ah/users 0 2010-11-03 13:27 192.168.130.12/hbase-0.20/ drwxr-xr-x ah/users 0 2010-11-02 14:42 192.168.130.13/ drwxr-xr-x ah/users 0 2010-11-03 13:27 192.168.130.13/hbase-0.20/ -rw-r--r-- ah/users 65903411 2010-11-02 14:42 192.168.130.13/hbase-0.20/hbase--zookeeper-e0-cb-4e-65-4d-4e.log -rw-r--r-- ah/users 0 2010-11-02 14:42 192.168.130.13/hbase-0.20/hbase--zookeeper-e0-cb-4e-65-4d-4e.out drwxr-xr-x ah/users 0 2010-11-02 14:42 192.168.130.14/ drwxr-xr-x ah/users 0 2010-11-03 13:27 192.168.130.14/hbase-0.20/ -rw-r--r-- ah/users 0 2010-11-02 14:42 192.168.130.14/hbase-0.20/hbase--zookeeper-e0-cb-4e-71-8-a8.out -rw-r--r-- ah/users 62835121 2010-11-02 14:42 192.168.130.14/hbase-0.20/hbase--zookeeper-e0-cb-4e-71-8-a8.log The only logs that are missing are those for .11, but that should not influence the analysis of the leader election (I hope). We are using monitoring software which determines when a zookeeper instance is no longer reachable, and automatically starts a fresh zookeeper instance as replacement. This software can determine the failure and start a new zookeeper instance fairly rapidly. Would it be better to delay the start of a fresh zookeeper instance to allow the existing instances to elect a new leader? If so, do you have any guidelines regarding this delay? (We are considering this approach, but would like to avoid it if possible). In your case, I'm still not sure why it happens because the initial zxid of node 1 is 4294967742 according to your excerpt. That is indeed the key question that I am trying to find an answer for!
        Hide
        Flavio Junqueira added a comment -

        The program I was using to open your logs was hiding some of the messages for some reason unknown to me. I now understand why the leader was elected in your case and the behavior is legitimate. Let me try to explain.

        We currently repeat the last notification sent to a given server upon reconnecting to it. This is to avoid problems with messages partially sent, and, assuming no further bugs, the protocol is resilient to messages duplicates. At the same time, a server A decides to follow another server B if it receives a message from B saying that B is leading and from a quorum saying that they are following, even if A is in a later election epoch. This mechanism is there to avoid A being locked out of the ensemble in the case it partitions away and comes back later.

        From you logs, what happens is:

        1. Fresh server 2 receives previous notifications from 0 and 1, and decide to lead;
        2. Server 1 receives the last message from server 0 saying that it is following 2 (which was the previous leader), and the notification from 2 saying that it is leading. Server 1 consequently decides to follow 2;
        3. Server 0 receives the last message from server 1 saying that it is following 2 (which was the previous leader), and the notification from 2 saying that it is leading. Server 0 consequently decides to follow 2.

        Now the main problem I see is that the followers accept the snapshot from the leader, and they shouldn't given that they have moved to a later epoch. I suspect that we currently allow a server to come back to an epoch it has been in the past to again avoid having a server locked out after being partitioned away and healing, but I need to do some further inspection.

        My overall take is that your case is unfortunately not legitimate, meaning that we don't currently provision for configuration changes. The case you expose in general constitutes a loss of quorum, and that violates one of our core assumptions. In more detail, a quorum supporting a leader must have a non-empty intersection with the quorum of servers that have accepted requests in the previous epoch. Wiping out the state of server 2, by replacing it with a fresh server, leads to the situation in which just one server contains all transactions accepted by a quorum (and possibly committed). If you hadn't replaced server 2 with a fresh server, then either server 2 would have been elected again just the same, and it would be fine because it was previously the leader, or it wouldn't have been elected because the leader was previously another server and the last notifications of 0 and 1 would be supporting a different server.

        On reconfigurations, we have talked about it (http://wiki.apache.org/hadoop/ZooKeeper/ClusterMembership), but we haven't made enough progress recently and it is currently not implemented. It would be great to get some help here.

        Let me know if this analysis makes any sense to you, please.

        Show
        Flavio Junqueira added a comment - The program I was using to open your logs was hiding some of the messages for some reason unknown to me. I now understand why the leader was elected in your case and the behavior is legitimate. Let me try to explain. We currently repeat the last notification sent to a given server upon reconnecting to it. This is to avoid problems with messages partially sent, and, assuming no further bugs, the protocol is resilient to messages duplicates. At the same time, a server A decides to follow another server B if it receives a message from B saying that B is leading and from a quorum saying that they are following, even if A is in a later election epoch. This mechanism is there to avoid A being locked out of the ensemble in the case it partitions away and comes back later. From you logs, what happens is: Fresh server 2 receives previous notifications from 0 and 1, and decide to lead; Server 1 receives the last message from server 0 saying that it is following 2 (which was the previous leader), and the notification from 2 saying that it is leading. Server 1 consequently decides to follow 2; Server 0 receives the last message from server 1 saying that it is following 2 (which was the previous leader), and the notification from 2 saying that it is leading. Server 0 consequently decides to follow 2. Now the main problem I see is that the followers accept the snapshot from the leader, and they shouldn't given that they have moved to a later epoch. I suspect that we currently allow a server to come back to an epoch it has been in the past to again avoid having a server locked out after being partitioned away and healing, but I need to do some further inspection. My overall take is that your case is unfortunately not legitimate, meaning that we don't currently provision for configuration changes. The case you expose in general constitutes a loss of quorum, and that violates one of our core assumptions. In more detail, a quorum supporting a leader must have a non-empty intersection with the quorum of servers that have accepted requests in the previous epoch. Wiping out the state of server 2, by replacing it with a fresh server, leads to the situation in which just one server contains all transactions accepted by a quorum (and possibly committed). If you hadn't replaced server 2 with a fresh server, then either server 2 would have been elected again just the same, and it would be fine because it was previously the leader, or it wouldn't have been elected because the leader was previously another server and the last notifications of 0 and 1 would be supporting a different server. On reconfigurations, we have talked about it ( http://wiki.apache.org/hadoop/ZooKeeper/ClusterMembership ), but we haven't made enough progress recently and it is currently not implemented. It would be great to get some help here. Let me know if this analysis makes any sense to you, please.
        Hide
        Alexandre Hardy added a comment -

        Hi Flavio,

        At first pass this seems to indicate that we can't replace a failed zookeeper server by a new one, but that statement is probably way too strong. If I understand correctly, what you are saying is that the server can be replaced only after a new leader has been elected? i.e. Any fresh server should only be restarted once the quorum has been reestablished?

        I'm not sure I understand exactly why the election went wrong. Were the old election messages resent when the Fresh server was contactable? I would have thought that election messages should be based on the current state, and never send old state.

        This will take some time to digest and think through properly. In the meantime, can you suggest how we should deal with this situation, can we simply wait for the two remaining nodes to establish a quorum, and then reintroduce the third node? I suppose we could test if a quorum has been established by testing if we can establish a new zookeeper session.

        Thanks for the help

        Show
        Alexandre Hardy added a comment - Hi Flavio, At first pass this seems to indicate that we can't replace a failed zookeeper server by a new one, but that statement is probably way too strong. If I understand correctly, what you are saying is that the server can be replaced only after a new leader has been elected? i.e. Any fresh server should only be restarted once the quorum has been reestablished? I'm not sure I understand exactly why the election went wrong. Were the old election messages resent when the Fresh server was contactable? I would have thought that election messages should be based on the current state, and never send old state. This will take some time to digest and think through properly. In the meantime, can you suggest how we should deal with this situation, can we simply wait for the two remaining nodes to establish a quorum, and then reintroduce the third node? I suppose we could test if a quorum has been established by testing if we can establish a new zookeeper session. Thanks for the help
        Hide
        Flavio Junqueira added a comment -

        Hi Alexandre, It is an key premise of important replication algorithms, like Paxos, that there is a portion of the state that persists across crashes (and recoveries). By replacing server 2 with a fresh server, you simply got rid of the persistent state. In general, making that replacement you've made may lead you to trouble due to the problem I exposed a few postings up. Of course, if you wait for a successful election, the problem is supposed to go away because you have reestablished a quorum and this quorum does not contain the faulty server, but then you have to make sure the election happens before you introduce the fresh server perhaps through jmx or by inspecting the logs. Simply setting a reasonable timeout will work in most cases, but the leader election is not guaranteed to succeed, and there is a chance, likely to be small, that you'll end up with a corrupt state.

        Show
        Flavio Junqueira added a comment - Hi Alexandre, It is an key premise of important replication algorithms, like Paxos, that there is a portion of the state that persists across crashes (and recoveries). By replacing server 2 with a fresh server, you simply got rid of the persistent state. In general, making that replacement you've made may lead you to trouble due to the problem I exposed a few postings up. Of course, if you wait for a successful election, the problem is supposed to go away because you have reestablished a quorum and this quorum does not contain the faulty server, but then you have to make sure the election happens before you introduce the fresh server perhaps through jmx or by inspecting the logs. Simply setting a reasonable timeout will work in most cases, but the leader election is not guaranteed to succeed, and there is a chance, likely to be small, that you'll end up with a corrupt state.
        Hide
        Alexandre Hardy added a comment -

        Thanks Flavio,

        We will take the delay approach for the moment. I think the risks are acceptable for our purposes.

        You are welcome to close the issue if there are no outstanding questions that need to be addressed on your part.

        Thanks again for spending so much time on this issue and explaining what the reasoning and consequences are.

        Show
        Alexandre Hardy added a comment - Thanks Flavio, We will take the delay approach for the moment. I think the risks are acceptable for our purposes. You are welcome to close the issue if there are no outstanding questions that need to be addressed on your part. Thanks again for spending so much time on this issue and explaining what the reasoning and consequences are.
        Hide
        Flavio Junqueira added a comment -

        My pleasure to help. I'm marking it as not a problem for now, but feel free to come back and ask for more clarification if needed.

        Show
        Flavio Junqueira added a comment - My pleasure to help. I'm marking it as not a problem for now, but feel free to come back and ask for more clarification if needed.
        Hide
        Patrick Hunt added a comment -

        Sounds like we need more documentation detailing the election process and what expected behavior is. Flavio perhaps you could
        create a JIRA for that and start collecting this type of information? In particular you could link to jiras of this type, with the intent
        of general documentation, including detail about these specific types of questions.

        Show
        Patrick Hunt added a comment - Sounds like we need more documentation detailing the election process and what expected behavior is. Flavio perhaps you could create a JIRA for that and start collecting this type of information? In particular you could link to jiras of this type, with the intent of general documentation, including detail about these specific types of questions.
        Hide
        Vishal Kher added a comment -

        Hi Flavio,

        Sorry for not making much progress on (http://wiki.apache.org/hadoop/ZooKeeper/ClusterMembership). I have spent some time to understand the code. But It is a bit difficult to focus on development without dedicated development time. I am pushing to get dedicated development time at work for this so that I don't have to rely on my spare time.

        Few questions related to your comments:
        1. Can you please elaborate on : "At the same time, a server A decides to follow another server B if it receives a message from B saying that B is leading and from a quorum saying that they are following, even if A is in a later election epoch. This mechanism is there to avoid A being locked out of the ensemble in the case it partitions away and comes back later."

        2. Why is it not OK for B to give up leadership when it sees that its <epoch,zxid> is lower than others?

        Thanks.

        Show
        Vishal Kher added a comment - Hi Flavio, Sorry for not making much progress on ( http://wiki.apache.org/hadoop/ZooKeeper/ClusterMembership ). I have spent some time to understand the code. But It is a bit difficult to focus on development without dedicated development time. I am pushing to get dedicated development time at work for this so that I don't have to rely on my spare time. Few questions related to your comments: 1. Can you please elaborate on : "At the same time, a server A decides to follow another server B if it receives a message from B saying that B is leading and from a quorum saying that they are following, even if A is in a later election epoch. This mechanism is there to avoid A being locked out of the ensemble in the case it partitions away and comes back later." 2. Why is it not OK for B to give up leadership when it sees that its <epoch,zxid> is lower than others? Thanks.
        Hide
        Flavio Junqueira added a comment -

        Hi Vishal, It is certainly understand not having dedicated development time being an issue. I actually didn't know you're interested in the cluster membership... I'm glad to hear, though.

        On your questions:

        1. Suppose we have an ensemble comprising 3 servers: A, B, and C. Now suppose that C is the leader, and both A and B follow C. If A disconnects from C for whatever reason (e.g., network partition) and it tries to elect a leader, it won't get any other process in the LOOKING state. It will actually receive a notification from C saying that it is leading and one from B saying that it is following C, both with an earlier leader election epoch. To avoid having A locked out (not able to elect C as leader), we implemented this exception: a process accepts going back to an earlier leader election only if it receives a notification from the leader saying that it is leading and from a quorum saying that it is following;
        2. I'm not sure if you referring to specific problem of this jira or if you are asking about my hypothetical example. Assuming it is the former, the follower (Follower:followLeader()) checks if the leader is proposing an earlier epoch, and if not, it accepts the leader snapshot. Because the epoch is the same, all followers will accept the leader snapshot follow it.
        Show
        Flavio Junqueira added a comment - Hi Vishal, It is certainly understand not having dedicated development time being an issue. I actually didn't know you're interested in the cluster membership... I'm glad to hear, though. On your questions: Suppose we have an ensemble comprising 3 servers: A, B, and C. Now suppose that C is the leader, and both A and B follow C. If A disconnects from C for whatever reason (e.g., network partition) and it tries to elect a leader, it won't get any other process in the LOOKING state. It will actually receive a notification from C saying that it is leading and one from B saying that it is following C, both with an earlier leader election epoch. To avoid having A locked out (not able to elect C as leader), we implemented this exception: a process accepts going back to an earlier leader election only if it receives a notification from the leader saying that it is leading and from a quorum saying that it is following; I'm not sure if you referring to specific problem of this jira or if you are asking about my hypothetical example. Assuming it is the former, the follower (Follower:followLeader()) checks if the leader is proposing an earlier epoch, and if not, it accepts the leader snapshot. Because the epoch is the same, all followers will accept the leader snapshot follow it.
        Hide
        Vishal Kher added a comment -

        Hi Flavio,

        Thanks for the clarification.

        Looks like this scenario is not likely to happen if replace is done carefully. For the sake of argument, shouldn't we make the exception that you described only if the joining peer is going to be a follower? For the problem reported in this jira, the server that was replaced (server 2) will not have seen the zxid received in the notifications from the other two nodes. Therefore, is this case, will it make sense for server 2 to start a new round of election?

        -Vishal

        Show
        Vishal Kher added a comment - Hi Flavio, Thanks for the clarification. Looks like this scenario is not likely to happen if replace is done carefully. For the sake of argument, shouldn't we make the exception that you described only if the joining peer is going to be a follower? For the problem reported in this jira, the server that was replaced (server 2) will not have seen the zxid received in the notifications from the other two nodes. Therefore, is this case, will it make sense for server 2 to start a new round of election? -Vishal
        Hide
        Flavio Junqueira added a comment -

        Hi Vishal, There is possibly a misunderstanding here. Server 2 reported in this jira (the leader) does not go back to an earlier epoch, but the other two do, and they are following, so if I understand your argument correctly, the exception is being applied as you suggest.

        Show
        Flavio Junqueira added a comment - Hi Vishal, There is possibly a misunderstanding here. Server 2 reported in this jira (the leader) does not go back to an earlier epoch, but the other two do, and they are following, so if I understand your argument correctly, the exception is being applied as you suggest.
        Hide
        Vishal Kher added a comment -

        Hi Flavio,

        Lets see if I understand this right. Server 2 was replaced and became the leader. Server 2 receives old notifications from others and accepts leadership even if its <epoch, zxid> prior to receiving any notifications was <1,0>.

        Server 2 accepts leadership because 0, 1 vote for 2 and we allow 2 to become leader based on point 1. in your comment on 07/Nov/10.

        My question with regards to point 1.:

        • In your example, it is OK to allow A to join the cluster and become a follower (so that A does not remain locked out). But is it OK for A to accept leadership even if it has not seen the zxid reported by others (regardless of the votes)? Shouldn't it reject leadership?

        Am I still misunderstanding the problem?

        Show
        Vishal Kher added a comment - Hi Flavio, Lets see if I understand this right. Server 2 was replaced and became the leader. Server 2 receives old notifications from others and accepts leadership even if its <epoch, zxid> prior to receiving any notifications was <1,0>. Server 2 accepts leadership because 0, 1 vote for 2 and we allow 2 to become leader based on point 1. in your comment on 07/Nov/10. My question with regards to point 1.: In your example, it is OK to allow A to join the cluster and become a follower (so that A does not remain locked out). But is it OK for A to accept leadership even if it has not seen the zxid reported by others (regardless of the votes)? Shouldn't it reject leadership? Am I still misunderstanding the problem?

          People

          • Assignee:
            Unassigned
            Reporter:
            Alexandre Hardy
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development