ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-822

Leader election taking a long time to complete

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 3.3.0
    • Fix Version/s: 3.3.2, 3.4.0
    • Component/s: quorum
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Created a 3 node cluster.

      1 Fail the ZK leader
      2. Let leader election finish. Restart the leader and let it join the
      3. Repeat

      After a few rounds leader election takes anywhere 25- 60 seconds to finish. Note- we didn't have any ZK clients and no new znodes were created.

      zoo.cfg is shown below:

      #Mon Jul 19 12:15:10 UTC 2010
      server.1=192.168.4.12\:2888\:3888
      server.0=192.168.4.11\:2888\:3888
      clientPort=2181
      dataDir=/var/zookeeper
      syncLimit=2
      server.2=192.168.4.13\:2888\:3888
      initLimit=5
      tickTime=2000

      I have attached logs from two nodes that took a long time to form the cluster after failing the leader. The leader was down anyways so logs from that node shouldn't matter.
      Look for "START HERE". Logs after that point should be of our interest.

      1. 822.tar.gz
        47 kB
        Vishal Kher
      2. rhel.tar.gz
        91 kB
        Vishal Kher
      3. test_zookeeper_1.log
        212 kB
        Vishal Kher
      4. test_zookeeper_2.log
        215 kB
        Vishal Kher
      5. zk_leader_election.tar.gz
        53 kB
        Vishal Kher
      6. zookeeper-3.4.0.tar.gz
        33 kB
        Vishal Kher
      7. ZOOKEEPER-822.patch
        9 kB
        Flavio Junqueira
      8. ZOOKEEPER-822.patch
        8 kB
        Flavio Junqueira
      9. ZOOKEEPER-822.patch
        7 kB
        Flavio Junqueira
      10. ZOOKEEPER-822.patch
        7 kB
        Flavio Junqueira
      11. ZOOKEEPER-822.patch
        7 kB
        Flavio Junqueira
      12. ZOOKEEPER-822.patch
        5 kB
        Flavio Junqueira
      13. ZOOKEEPER-822.patch_v1
        6 kB
        Vishal Kher
      14. ZOOKEEPER-822-3.3.2.patch
        9 kB
        Flavio Junqueira
      15. ZOOKEEPER-822-3.3.2.patch
        8 kB
        Flavio Junqueira
      16. ZOOKEEPER-822-3.3.2.patch
        7 kB
        Flavio Junqueira
      17. ZOOKEEPER-822-3.3.2.patch
        7 kB
        Flavio Junqueira

        Activity

        Hide
        Hudson added a comment -

        Integrated in ZooKeeper-trunk #959 (See https://hudson.apache.org/hudson/job/ZooKeeper-trunk/959/)
        ZOOKEEPER-822. Leader election taking a long time to complete

        Show
        Hudson added a comment - Integrated in ZooKeeper-trunk #959 (See https://hudson.apache.org/hudson/job/ZooKeeper-trunk/959/ ) ZOOKEEPER-822 . Leader election taking a long time to complete
        Hide
        Patrick Hunt added a comment -

        Committed to trunk/33 - thanks Vishal and everyone who pushed this through!

        Show
        Patrick Hunt added a comment - Committed to trunk/33 - thanks Vishal and everyone who pushed this through!
        Hide
        Benjamin Reed added a comment -

        +1 looks good. ready to commit.

        Show
        Benjamin Reed added a comment - +1 looks good. ready to commit.
        Hide
        Flavio Junqueira added a comment -

        Thanks for the comments, Ben. I have modified zookeeperAdmin and added the "zookeeper." prefix to the code.

        Regarding your question, initiateConnection is called from two methods: testInitiateConnection (used only in tests) and connectOne. connectOne is synchronized. Do you still see an issue?

        Show
        Flavio Junqueira added a comment - Thanks for the comments, Ben. I have modified zookeeperAdmin and added the "zookeeper." prefix to the code. Regarding your question, initiateConnection is called from two methods: testInitiateConnection (used only in tests) and connectOne. connectOne is synchronized. Do you still see an issue?
        Hide
        Benjamin Reed added a comment -

        looks good overall flavio. just a quick questions: i notice that operations on senderWorkerMap in initiateConnection are not synchronized. senderWorkerMap is concurrent, but there could be a race between the get, put, and vsw.finish if initiateConnection is called concurrently for the same sid. right?

        also you need to add a blurb to the config doc for the timeout system variable, which should be "zookeeper.cnxtimeout" so that it can be set from the configuration file.

        Show
        Benjamin Reed added a comment - looks good overall flavio. just a quick questions: i notice that operations on senderWorkerMap in initiateConnection are not synchronized. senderWorkerMap is concurrent, but there could be a race between the get, put, and vsw.finish if initiateConnection is called concurrently for the same sid. right? also you need to add a blurb to the config doc for the timeout system variable, which should be "zookeeper.cnxtimeout" so that it can be set from the configuration file.
        Hide
        Flavio Junqueira added a comment -

        Thanks for reviewing it, Vishal. I have fixed the LOG.warn you pointed out and uploaded new patch files.

        Show
        Flavio Junqueira added a comment - Thanks for reviewing it, Vishal. I have fixed the LOG.warn you pointed out and uploaded new patch files.
        Hide
        Vishal Kher added a comment -

        Hi Flavio,

        +1. Looks good. I remember looking at the socket.connect() method, but I don't remember why I ruled it out in the favor of thread.
        Minor point - missing space before "error" in LOG.warn("Connection broken: for id " + sid + "my id = " + self.getId() + "error..).

        Thank you.

        -Vishal

        Show
        Vishal Kher added a comment - Hi Flavio, +1. Looks good. I remember looking at the socket.connect() method, but I don't remember why I ruled it out in the favor of thread. Minor point - missing space before "error" in LOG.warn("Connection broken: for id " + sid + "my id = " + self.getId() + "error..). Thank you. -Vishal
        Hide
        Flavio Junqueira added a comment -

        I have added a system property called "cnxtimeout" to change the timeout value in QuorumCnxManager. Tests pass for me.

        Show
        Flavio Junqueira added a comment - I have added a system property called "cnxtimeout" to change the timeout value in QuorumCnxManager. Tests pass for me.
        Hide
        Flavio Junqueira added a comment -

        Going to submit patches that introduce a system property.

        Show
        Flavio Junqueira added a comment - Going to submit patches that introduce a system property.
        Hide
        Vishal Kher added a comment -

        Hi Flavio,

        Thanks. I will take a look at the patches.

        -Vishal

        Show
        Vishal Kher added a comment - Hi Flavio, Thanks. I will take a look at the patches. -Vishal
        Hide
        Flavio Junqueira added a comment -

        Attaching patch for 3.3.2.

        Show
        Flavio Junqueira added a comment - Attaching patch for 3.3.2.
        Hide
        Flavio Junqueira added a comment -

        I'm adding a test to the patch. It tries to send a message to an address for which a connection request receives no response, so it has to timeout. The test then checks that the amount of time elapsed is less than 6s (the timeout value is hardcoded 5s). Raising the timeout from 5s to say 7s makes the test fail.

        Show
        Flavio Junqueira added a comment - I'm adding a test to the patch. It tries to send a message to an address for which a connection request receives no response, so it has to timeout. The test then checks that the amount of time elapsed is less than 6s (the timeout value is hardcoded 5s). Raising the timeout from 5s to say 7s makes the test fail.
        Hide
        Flavio Junqueira added a comment -

        I believe the patch I'm attaching achieves the same goal and is even simpler, but I'd like to make sure it suits your needs, Vishal.

        If you agree with the modifications, I can work on a test. I was also thinking that the 2-second timeout you used before is too low and I've raised to 5 seconds. But, instead of trying to argue which value is ideal, I'd rather use a system property and use a default value of at least 5 seconds.

        I also commit to redesigning QuorumCnxManager for either 3.4.0 or 4.0.0 to use a selector or some other approach we agree upon. I've been wanting to do it for a while anyway, and I actually thought there was a jira open for it... Maybe not, I can't find it right now.

        Show
        Flavio Junqueira added a comment - I believe the patch I'm attaching achieves the same goal and is even simpler, but I'd like to make sure it suits your needs, Vishal. If you agree with the modifications, I can work on a test. I was also thinking that the 2-second timeout you used before is too low and I've raised to 5 seconds. But, instead of trying to argue which value is ideal, I'd rather use a system property and use a default value of at least 5 seconds. I also commit to redesigning QuorumCnxManager for either 3.4.0 or 4.0.0 to use a selector or some other approach we agree upon. I've been wanting to do it for a while anyway, and I actually thought there was a jira open for it... Maybe not, I can't find it right now.
        Hide
        Mahadev konar added a comment -

        vishal,
        I was expecting some commitment from you for making it use a selector .

        Show
        Mahadev konar added a comment - vishal, I was expecting some commitment from you for making it use a selector .
        Hide
        Vishal Kher added a comment -

        I agree with Mahadev.

        Show
        Vishal Kher added a comment - I agree with Mahadev.
        Hide
        Mahadev konar added a comment -

        visha, flavio,
        If there is just one thread running at one point in time, then its ok. Also, I am really worried about the code structure in LeaderElection.java. Its ok to have a temporary fix, but it would be great to see some commitment from someone on doing it right in 3.4.

        Show
        Mahadev konar added a comment - visha, flavio, If there is just one thread running at one point in time, then its ok. Also, I am really worried about the code structure in LeaderElection.java. Its ok to have a temporary fix, but it would be great to see some commitment from someone on doing it right in 3.4.
        Hide
        Vishal Kher added a comment -

        Hi Flavio,

        As I mentioned earlier, this is a temporary patch until the selector based approach (non-blocking IO) is ready.

        In general, what is the concern with the current fix? There will be only one thread running at a time. The thread just makes sure that we can bound the connection time. This patch is working well for us as a temporary fix. Apart from the overhead of starting a thread I don't see anything wrong with the fix.

        Again, given that this bug is a blocker for us, we certainly cannot wait until the non-blocking implementation is done and released.

        Thanks.
        -Vishal

        Show
        Vishal Kher added a comment - Hi Flavio, As I mentioned earlier, this is a temporary patch until the selector based approach (non-blocking IO) is ready. In general, what is the concern with the current fix? There will be only one thread running at a time. The thread just makes sure that we can bound the connection time. This patch is working well for us as a temporary fix. Apart from the overhead of starting a thread I don't see anything wrong with the fix. Again, given that this bug is a blocker for us, we certainly cannot wait until the non-blocking implementation is done and released. Thanks. -Vishal
        Hide
        Flavio Junqueira added a comment -

        Hi Vishal, I have taken a look at your patch. As I said before, it sounds good to me to make SocketChannel non-blocking, but I don't like very much the approach of creating one thread per connection attempt. Instead, I was thinking that we should try to use a selector. What do you think?

        Show
        Flavio Junqueira added a comment - Hi Vishal, I have taken a look at your patch. As I said before, it sounds good to me to make SocketChannel non-blocking, but I don't like very much the approach of creating one thread per connection attempt. Instead, I was thinking that we should try to use a selector. What do you think?
        Hide
        Flavio Junqueira added a comment -

        I'll have a look at it, Vishal. Thanks for posting it.

        Show
        Flavio Junqueira added a comment - I'll have a look at it, Vishal. Thanks for posting it.
        Hide
        Vishal Kher added a comment -

        Hi Flavio,

        > I think we need some time to converge on problems and fixes.

        I don't think it would take a long time to converge. I think the patch that I attached is quite simple. After adding a new property for timeout we should be good to go.

        > My understanding is that we want to have 3.3.2 out soon, and my feeling is that this is not a blocker for 3.3.2 given Vishal's description and our experience with the system so far, but it would be good to hear from Vishal.

        From our earlier email exchanges I have a feeling that in most cases FLE was tested by restarting the ZooKeeper service (and not by rebooting/shutting down the host). I am a bit concerned that enough time may not have been spent in testing/reproducing this problem. In my opinion, this fix should go in 3.3.2. I know for sure that we won't be able to use the next release as is without this fix.

        Thanks.
        -Vishal

        Show
        Vishal Kher added a comment - Hi Flavio, > I think we need some time to converge on problems and fixes. I don't think it would take a long time to converge. I think the patch that I attached is quite simple. After adding a new property for timeout we should be good to go. > My understanding is that we want to have 3.3.2 out soon, and my feeling is that this is not a blocker for 3.3.2 given Vishal's description and our experience with the system so far, but it would be good to hear from Vishal. From our earlier email exchanges I have a feeling that in most cases FLE was tested by restarting the ZooKeeper service (and not by rebooting/shutting down the host). I am a bit concerned that enough time may not have been spent in testing/reproducing this problem. In my opinion, this fix should go in 3.3.2. I know for sure that we won't be able to use the next release as is without this fix. Thanks. -Vishal
        Hide
        Vishal Kher added a comment -

        Hi,

        I have attached the diff of my changes. It is a pretty simple fix. My intention was to have a fix with minimal code changes.

        connectOne waits for 2 seconds for a connection to establish. Right now the value is hard-coded. I was planning to add a new property that would specify the amount of time to wait for a connect request. If you think that the proposed changes are good enough, I will make the change for using the property value and resubmit a second patch.

        Let me know.

        -Vishal

        Show
        Vishal Kher added a comment - Hi, I have attached the diff of my changes. It is a pretty simple fix. My intention was to have a fix with minimal code changes. connectOne waits for 2 seconds for a connection to establish. Right now the value is hard-coded. I was planning to add a new property that would specify the amount of time to wait for a connect request. If you think that the proposed changes are good enough, I will make the change for using the property value and resubmit a second patch. Let me know. -Vishal
        Hide
        Flavio Junqueira added a comment -

        I think we need some time to converge on problems and fixes. My understanding is that we want to have 3.3.2 out soon, and my feeling is that this is not a blocker for 3.3.2 given Vishal's description and our experience with the system so far, but it would be good to hear from Vishal.

        Show
        Flavio Junqueira added a comment - I think we need some time to converge on problems and fixes. My understanding is that we want to have 3.3.2 out soon, and my feeling is that this is not a blocker for 3.3.2 given Vishal's description and our experience with the system so far, but it would be good to hear from Vishal.
        Hide
        Mahadev konar added a comment -

        Marking this for 3.3.2, to see if we want this included in 3.3.2.

        Show
        Mahadev konar added a comment - Marking this for 3.3.2, to see if we want this included in 3.3.2.
        Hide
        Flavio Junqueira added a comment -

        1. Blocking connects and accepts:
        You are right, when the node is down TCP timeouts rule.

        a) The first problem is in manager.toSend(). This invokes connectOne(), which does a blocking connect. While testing, I changed the code so that connectOne() starts a new thread called AsyncConnct(). AsyncConnect.run() does a socketChannel.connect(). After starting AsyncConnect, connectOne starts a timer. connectOne continues with normal operations if the connection is established before the timer expires, otherwise, when the timer expires it interrupts AsyncConnect() thread and returns. In this way, I can have an upper bound on the amount of time we need to wait for connect to succeed. Of course, this was a quick fix for my testing. Ideally, we should use Selector to do non-blocking connects/accepts. I am planning to do that later once we at least have a quick fix for the problem and consensus from others for the real fix (this problem is big blocker for us). Note that it is OK to do blocking IO in SenderWorker and RecvWorker threads since they block IO to the respective peer.

        As I commented before, it might be ok to make it asynchronous, especially if we have a way of checking that there is an attempt to establish a connection in progress.
        I'm also still intrigued about why this is a problem for you. I haven't seen any of this being a problem before, which of course doesn't mean we shouldn't fix it. It would be nice to understand what's special about your setup or if others have seen similar problems and I missed the reports.

        b) The blocking IO problem is not just restricted to connectOne(), but also in receiveConnection(). The Listener thread calls receiveConnection() for each incoming connection request. receiveConnection does blocking IO to get peer's info (s.read(msgBuffer)). Worse, it invokes connectOne() back to the peer that had sent the connection request. All of this is happening from the Listener. In short, if a peer fails after initiating a connection, the Listener thread won't be able to accept connections from other peers, because it would be stuck in read() or connetOne(). Also the code has an inherent cycle. initiateConnection() and receiveConnection() will have to be very carefully synchronized otherwise, we could run into deadlocks. This code is going to be difficult to maintain/modify.

        If I remember correctly, we currently synchronize connectOne and make all connection establishments through connectOne so that we make sure that we do one at a time. My understanding is that this should reduce the number of rounds of attempts to establish connections, perhaps at the cost of a longer delay in some runs.

        2. Buggy senderWorkerMap handling:
        The code that manages senderWorkerMap is very buggy. It is causing multiple election rounds. While debugging I found that sometimes after FLE a node will have its sendWorkerMap empty even if it has SenderWorker and RecvWorker threads for each peer.

        I don't think that having multiple rounds is bad; in fact, I think it is unavoidable using reasonable timeout values. The second part, however, sounds like a problem we should fix.

        a) The receiveConnection() method calls the finish() method, which removes an entry from the map. Additionally, the thread itself calls finish() which could remove the newly added entry from the map. In short, receiveConnection is causing the exact condition that you mentioned above.

        I thought that we were increasing the intervals between notifications, and if so I believe the case you mention above should not happen more than a few times. Now, to fix it, it sounds like we need to check that the finish call is removing the correct object in sendWorkerMap. That is, obj.finish() should remove obj and do nothing if the SendWorker object in sendWorkerMap is a different one. What do you think?

        b) Apart from the bug in finish(), receiveConnection is making an entry in senderWorkerMap at the wrong place. Here's the buggy code:
        SendWorker vsw = senderWorkerMap.get(sid);
        senderWorkerMap.put(sid, sw);
        if(vsw != null)
        vsw.finish();
        It makes an entry for the new thread and then calls finish, which causes the new thread to be removed from the Map. The old thread will also get terminated since finish() will interrupt the thread.

        See my comment above. Perhaps I should wait to see your proposed modifications, but I wonder if works to check that we are removing the correct SendWorker object.

        3. Race condition in receiveConnection and initiateConnection:

        In theory, two peers can keep disconnecting each other's connection.

        Example:
        T0: Peer 0 initiates a connection (request 1)
        T1: Peer 1 receives connection from peer 0
        T2: Peer 1 calls receiveConnection()
        T2: Peer 0 closes connection to Peer 1 because its ID is lower.
        T3: Peer 0 re-initiates connection to Peer 1 from manger.toSend() (request 2)
        T3: Peer 1 terminates older connection to peer 0
        T4: Peer 1 calls connectOne() which starts new sendWorker threads for peer 0
        T5: Peer 1 kills connection created in T3 because it receives another (request 2) connect request from 0

        The problem here is that while Peer 0 is accepting a connection from Peer 1 it can also be initiating a connection to Peer 1. So if they hit the right frequencies they could sit in a connect/disconnect loop and cause multiple rounds of leader election.

        I think the cause here is again blocking connects()/accepts(). A peer starts to take action (to kill existing threads and start new threads) as soon as a connection is established at the TCP level. That is, it does not give us any control to synchronized connect and accepts. We could use non-blocking connects and accepts. This will allow us to a) tell a thread to not initiate a connection because the listener is about to accept a connection from the remote peer (use isAcceptable() and isConnectable()methods of SelectionKey) and b) prevent a thread from initiating multiple connect request to the same peer. It will simplify synchronization.

        Even though it is true that you could have an infinite run where the two processes bump into each other forever, we increase the interval between new batches of notifications, so it should eventually stop, and in my experience it typically doesn't go beyond two rounds. However, I agree that if there were a way of verifying that there is a connection establishment in progress we could stop earlier. It sounds like a good idea to give it a try.

        Show
        Flavio Junqueira added a comment - 1. Blocking connects and accepts: You are right, when the node is down TCP timeouts rule. a) The first problem is in manager.toSend(). This invokes connectOne(), which does a blocking connect. While testing, I changed the code so that connectOne() starts a new thread called AsyncConnct(). AsyncConnect.run() does a socketChannel.connect(). After starting AsyncConnect, connectOne starts a timer. connectOne continues with normal operations if the connection is established before the timer expires, otherwise, when the timer expires it interrupts AsyncConnect() thread and returns. In this way, I can have an upper bound on the amount of time we need to wait for connect to succeed. Of course, this was a quick fix for my testing. Ideally, we should use Selector to do non-blocking connects/accepts. I am planning to do that later once we at least have a quick fix for the problem and consensus from others for the real fix (this problem is big blocker for us). Note that it is OK to do blocking IO in SenderWorker and RecvWorker threads since they block IO to the respective peer. As I commented before, it might be ok to make it asynchronous, especially if we have a way of checking that there is an attempt to establish a connection in progress. I'm also still intrigued about why this is a problem for you. I haven't seen any of this being a problem before, which of course doesn't mean we shouldn't fix it. It would be nice to understand what's special about your setup or if others have seen similar problems and I missed the reports. b) The blocking IO problem is not just restricted to connectOne(), but also in receiveConnection(). The Listener thread calls receiveConnection() for each incoming connection request. receiveConnection does blocking IO to get peer's info (s.read(msgBuffer)). Worse, it invokes connectOne() back to the peer that had sent the connection request. All of this is happening from the Listener. In short, if a peer fails after initiating a connection, the Listener thread won't be able to accept connections from other peers, because it would be stuck in read() or connetOne(). Also the code has an inherent cycle. initiateConnection() and receiveConnection() will have to be very carefully synchronized otherwise, we could run into deadlocks. This code is going to be difficult to maintain/modify. If I remember correctly, we currently synchronize connectOne and make all connection establishments through connectOne so that we make sure that we do one at a time. My understanding is that this should reduce the number of rounds of attempts to establish connections, perhaps at the cost of a longer delay in some runs. 2. Buggy senderWorkerMap handling: The code that manages senderWorkerMap is very buggy. It is causing multiple election rounds. While debugging I found that sometimes after FLE a node will have its sendWorkerMap empty even if it has SenderWorker and RecvWorker threads for each peer. I don't think that having multiple rounds is bad; in fact, I think it is unavoidable using reasonable timeout values. The second part, however, sounds like a problem we should fix. a) The receiveConnection() method calls the finish() method, which removes an entry from the map. Additionally, the thread itself calls finish() which could remove the newly added entry from the map. In short, receiveConnection is causing the exact condition that you mentioned above. I thought that we were increasing the intervals between notifications, and if so I believe the case you mention above should not happen more than a few times. Now, to fix it, it sounds like we need to check that the finish call is removing the correct object in sendWorkerMap. That is, obj.finish() should remove obj and do nothing if the SendWorker object in sendWorkerMap is a different one. What do you think? b) Apart from the bug in finish(), receiveConnection is making an entry in senderWorkerMap at the wrong place. Here's the buggy code: SendWorker vsw = senderWorkerMap.get(sid); senderWorkerMap.put(sid, sw); if(vsw != null) vsw.finish(); It makes an entry for the new thread and then calls finish, which causes the new thread to be removed from the Map. The old thread will also get terminated since finish() will interrupt the thread. See my comment above. Perhaps I should wait to see your proposed modifications, but I wonder if works to check that we are removing the correct SendWorker object. 3. Race condition in receiveConnection and initiateConnection: In theory, two peers can keep disconnecting each other's connection. Example: T0: Peer 0 initiates a connection (request 1) T1: Peer 1 receives connection from peer 0 T2: Peer 1 calls receiveConnection() T2: Peer 0 closes connection to Peer 1 because its ID is lower. T3: Peer 0 re-initiates connection to Peer 1 from manger.toSend() (request 2) T3: Peer 1 terminates older connection to peer 0 T4: Peer 1 calls connectOne() which starts new sendWorker threads for peer 0 T5: Peer 1 kills connection created in T3 because it receives another (request 2) connect request from 0 The problem here is that while Peer 0 is accepting a connection from Peer 1 it can also be initiating a connection to Peer 1. So if they hit the right frequencies they could sit in a connect/disconnect loop and cause multiple rounds of leader election. I think the cause here is again blocking connects()/accepts(). A peer starts to take action (to kill existing threads and start new threads) as soon as a connection is established at the TCP level. That is, it does not give us any control to synchronized connect and accepts. We could use non-blocking connects and accepts. This will allow us to a) tell a thread to not initiate a connection because the listener is about to accept a connection from the remote peer (use isAcceptable() and isConnectable()methods of SelectionKey) and b) prevent a thread from initiating multiple connect request to the same peer. It will simplify synchronization. Even though it is true that you could have an infinite run where the two processes bump into each other forever, we increase the interval between new batches of notifications, so it should eventually stop, and in my experience it typically doesn't go beyond two rounds. However, I agree that if there were a way of verifying that there is a connection establishment in progress we could stop earlier. It sounds like a good idea to give it a try.
        Hide
        Vishal Kher added a comment -

        Hi Flavio,

        I was planning to send out a mail explaining the problems in the FLE implementation that I have found so far. For now, I will put the info here. We can create new JIRAs if needed. I am waiting to hear back from our legal department to resolve copyright issues so that I can share my fixes as well.

        1. Blocking connects and accepts:
        You are right, when the node is down TCP timeouts rule.

        a) The first problem is in manager.toSend(). This invokes connectOne(), which does a blocking connect. While testing, I changed the code so that connectOne() starts a new thread called AsyncConnct(). AsyncConnect.run() does a socketChannel.connect(). After starting AsyncConnect, connectOne starts a timer. connectOne continues with normal operations if the connection is established before the timer expires, otherwise, when the timer expires it interrupts AsyncConnect() thread and returns. In this way, I can have an upper bound on the amount of time we need to wait for connect to succeed. Of course, this was a quick fix for my testing. Ideally, we should use Selector to do non-blocking connects/accepts. I am planning to do that later once we at least have a quick fix for the problem and consensus from others for the real fix (this problem is big blocker for us). Note that it is OK to do blocking IO in SenderWorker and RecvWorker threads since they block IO to the respective peer.

        b) The blocking IO problem is not just restricted to connectOne(), but also in receiveConnection(). The Listener thread calls receiveConnection() for each incoming connection request. receiveConnection does blocking IO to get peer's info (s.read(msgBuffer)). Worse, it invokes connectOne() back to the peer that had sent the connection request. All of this is happening from the Listener. In short, if a peer fails after initiating a connection, the Listener thread won't be able to accept connections from other peers, because it would be stuck in read() or connetOne(). Also the code has an inherent cycle. initiateConnection() and receiveConnection() will have to be very carefully synchronized otherwise, we could run into deadlocks. This code is going to be difficult to maintain/modify.

        2. Buggy senderWorkerMap handling:
        The code that manages senderWorkerMap is very buggy. It is causing multiple election rounds. While debugging I found that sometimes after FLE a node will have its sendWorkerMap empty even if it has SenderWorker and RecvWorker threads for each peer.

        a) The receiveConnection() method calls the finish() method, which removes an entry from the map. Additionally, the thread itself calls finish() which could remove the newly added entry from the map. In short, receiveConnection is causing the exact condition that you mentioned above.

        b) Apart from the bug in finish(), receiveConnection is making an entry in senderWorkerMap at the wrong place. Here's the buggy code:
        SendWorker vsw = senderWorkerMap.get(sid);
        senderWorkerMap.put(sid, sw);
        if(vsw != null)
        vsw.finish();
        It makes an entry for the new thread and then calls finish, which causes the new thread to be removed from the Map. The old thread will also get terminated since finish() will interrupt the thread.

        3. Race condition in receiveConnection and initiateConnection:

        In theory, two peers can keep disconnecting each other's connection.

        Example:
        T0: Peer 0 initiates a connection (request 1)
        T1: Peer 1 receives connection from peer 0
        T2: Peer 1 calls receiveConnection()
        T2: Peer 0 closes connection to Peer 1 because its ID is lower.
        T3: Peer 0 re-initiates connection to Peer 1 from manger.toSend() (request 2)
        T3: Peer 1 terminates older connection to peer 0
        T4: Peer 1 calls connectOne() which starts new sendWorker threads for peer 0
        T5: Peer 1 kills connection created in T3 because it receives another (request 2) connect request from 0

        The problem here is that while Peer 0 is accepting a connection from Peer 1 it can also be initiating a connection to Peer 1. So if they hit the right frequencies they could sit in a connect/disconnect loop and cause multiple rounds of leader election.

        I think the cause here is again blocking connects()/accepts(). A peer starts to take action (to kill existing threads and start new threads) as soon as a connection is established at the TCP level. That is, it does not give us any control to synchronized connect and accepts. We could use non-blocking connects and accepts. This will allow us to a) tell a thread to not initiate a connection because the listener is about to accept a connection from the remote peer (use isAcceptable() and isConnectable()methods of SelectionKey) and b) prevent a thread from initiating multiple connect request to the same peer. It will simplify synchronization.

        Any thoughts?

        -Vishal

        Show
        Vishal Kher added a comment - Hi Flavio, I was planning to send out a mail explaining the problems in the FLE implementation that I have found so far. For now, I will put the info here. We can create new JIRAs if needed. I am waiting to hear back from our legal department to resolve copyright issues so that I can share my fixes as well. 1. Blocking connects and accepts: You are right, when the node is down TCP timeouts rule. a) The first problem is in manager.toSend(). This invokes connectOne(), which does a blocking connect. While testing, I changed the code so that connectOne() starts a new thread called AsyncConnct(). AsyncConnect.run() does a socketChannel.connect(). After starting AsyncConnect, connectOne starts a timer. connectOne continues with normal operations if the connection is established before the timer expires, otherwise, when the timer expires it interrupts AsyncConnect() thread and returns. In this way, I can have an upper bound on the amount of time we need to wait for connect to succeed. Of course, this was a quick fix for my testing. Ideally, we should use Selector to do non-blocking connects/accepts. I am planning to do that later once we at least have a quick fix for the problem and consensus from others for the real fix (this problem is big blocker for us). Note that it is OK to do blocking IO in SenderWorker and RecvWorker threads since they block IO to the respective peer. b) The blocking IO problem is not just restricted to connectOne(), but also in receiveConnection(). The Listener thread calls receiveConnection() for each incoming connection request. receiveConnection does blocking IO to get peer's info (s.read(msgBuffer)). Worse, it invokes connectOne() back to the peer that had sent the connection request. All of this is happening from the Listener. In short, if a peer fails after initiating a connection, the Listener thread won't be able to accept connections from other peers, because it would be stuck in read() or connetOne(). Also the code has an inherent cycle. initiateConnection() and receiveConnection() will have to be very carefully synchronized otherwise, we could run into deadlocks. This code is going to be difficult to maintain/modify. 2. Buggy senderWorkerMap handling: The code that manages senderWorkerMap is very buggy. It is causing multiple election rounds. While debugging I found that sometimes after FLE a node will have its sendWorkerMap empty even if it has SenderWorker and RecvWorker threads for each peer. a) The receiveConnection() method calls the finish() method, which removes an entry from the map. Additionally, the thread itself calls finish() which could remove the newly added entry from the map. In short, receiveConnection is causing the exact condition that you mentioned above. b) Apart from the bug in finish(), receiveConnection is making an entry in senderWorkerMap at the wrong place. Here's the buggy code: SendWorker vsw = senderWorkerMap.get(sid); senderWorkerMap.put(sid, sw); if(vsw != null) vsw.finish(); It makes an entry for the new thread and then calls finish, which causes the new thread to be removed from the Map. The old thread will also get terminated since finish() will interrupt the thread. 3. Race condition in receiveConnection and initiateConnection: In theory , two peers can keep disconnecting each other's connection. Example: T0: Peer 0 initiates a connection (request 1) T1: Peer 1 receives connection from peer 0 T2: Peer 1 calls receiveConnection() T2: Peer 0 closes connection to Peer 1 because its ID is lower. T3: Peer 0 re-initiates connection to Peer 1 from manger.toSend() (request 2) T3: Peer 1 terminates older connection to peer 0 T4: Peer 1 calls connectOne() which starts new sendWorker threads for peer 0 T5: Peer 1 kills connection created in T3 because it receives another (request 2) connect request from 0 The problem here is that while Peer 0 is accepting a connection from Peer 1 it can also be initiating a connection to Peer 1. So if they hit the right frequencies they could sit in a connect/disconnect loop and cause multiple rounds of leader election. I think the cause here is again blocking connects()/accepts(). A peer starts to take action (to kill existing threads and start new threads) as soon as a connection is established at the TCP level . That is, it does not give us any control to synchronized connect and accepts. We could use non-blocking connects and accepts. This will allow us to a) tell a thread to not initiate a connection because the listener is about to accept a connection from the remote peer (use isAcceptable() and isConnectable()methods of SelectionKey) and b) prevent a thread from initiating multiple connect request to the same peer. It will simplify synchronization. Any thoughts? -Vishal
        Hide
        Flavio Junqueira added a comment -

        Hi VIshal, Good catches:

        1- It sounds right that blocking the connection establishment might increase the time to election unnecessarily when the other party is not up. Here is my interpretation. If the machine is up but the the zk server is not running, then we simply get a connection failure and move on. The same doesn't happen when the the machine is down, since we need to wait for the connection establishment to time out;
        2- It sounds right that a connection can be dropped erroneously due to a race, but I don't see in which case it can cause the election time to increase substantially, unless the race is triggered multiple times in a row. A server will try to connect upon every new notification, and a server only calls SendWorker.finish() in receiveNotification if it has a higher identifier. In this case, it creates a new connection immediately after, so it would need a previous connection being dropped right before to have the case you're describing;
        3- Servers with higher identifiers decline connection requests from servers with lower identifiers; it is part of the protocol. Is this what you're referring to?

        Show
        Flavio Junqueira added a comment - Hi VIshal, Good catches: 1- It sounds right that blocking the connection establishment might increase the time to election unnecessarily when the other party is not up. Here is my interpretation. If the machine is up but the the zk server is not running, then we simply get a connection failure and move on. The same doesn't happen when the the machine is down, since we need to wait for the connection establishment to time out; 2- It sounds right that a connection can be dropped erroneously due to a race, but I don't see in which case it can cause the election time to increase substantially, unless the race is triggered multiple times in a row. A server will try to connect upon every new notification, and a server only calls SendWorker.finish() in receiveNotification if it has a higher identifier. In this case, it creates a new connection immediately after, so it would need a previous connection being dropped right before to have the case you're describing; 3- Servers with higher identifiers decline connection requests from servers with lower identifiers; it is part of the protocol. Is this what you're referring to?
        Hide
        Vishal Kher added a comment -

        The fix for problem 1 and 2 above eliminates the bug. I will have a patch out soon.

        Show
        Vishal Kher added a comment - The fix for problem 1 and 2 above eliminates the bug. I will have a patch out soon.
        Hide
        Vishal Kher added a comment -

        Correction:
        2. SendWorker.run() calls finish at the end. This could result in finish() getting called twice (e.g., finish called from receiveConnection), thus, causing senderWorkerMap.remove(sid) called twice and removing an entry that should not be removed.

        Show
        Vishal Kher added a comment - Correction: 2. SendWorker.run() calls finish at the end. This could result in finish() getting called twice (e.g., finish called from receiveConnection), thus, causing senderWorkerMap.remove(sid) called twice and removing an entry that should not be removed.
        Hide
        Vishal Kher added a comment -

        Hi,

        To rule out any setup specific issues (VM/ESX/ect), I tried to reproduce this problem with 3 physical machines (no VMs). I was able to reproduce the same problem after 15 reboots or so. It took a little more number of reboots on the physical setup. I used the latest code this time. I have attached the logs.

        I am suspecting 3 problems:
        1. blocking connect in WorkerSender.process as described in my earlier comments.
        2. SendWorker.run() calls finish at the end. This could result in finish() getting called twice (e.g., finish called from receiveConnection), thus, causing senderWorkerMap.remove(sid) called twice and removing an entry that should be removed.
        3. Race condition that causes one of the peers to disconnect other peer's connection (receiveConnection/initiateConnection issue). I am still verifying/investigating the this.

        In logs of server id 0 and 1, search for **REBOOTING LEADER** from the bottom. This line marks the start of faulty FLE.

        -Vishal

        Show
        Vishal Kher added a comment - Hi, To rule out any setup specific issues (VM/ESX/ect), I tried to reproduce this problem with 3 physical machines (no VMs). I was able to reproduce the same problem after 15 reboots or so. It took a little more number of reboots on the physical setup. I used the latest code this time. I have attached the logs. I am suspecting 3 problems: 1. blocking connect in WorkerSender.process as described in my earlier comments. 2. SendWorker.run() calls finish at the end. This could result in finish() getting called twice (e.g., finish called from receiveConnection), thus, causing senderWorkerMap.remove(sid) called twice and removing an entry that should be removed. 3. Race condition that causes one of the peers to disconnect other peer's connection (receiveConnection/initiateConnection issue). I am still verifying/investigating the this. In logs of server id 0 and 1, search for ** REBOOTING LEADER ** from the bottom. This line marks the start of faulty FLE. -Vishal
        Hide
        Vishal Kher added a comment -

        Hi Flavio,

        Ah! my trunk is quite old then. But I don't think it is necessary to run with the latest code for debugging this issue.

        I have identified one problem in WorkerSender.process(). This function calls manager.toSend() whicih calls connectOne. connectOne does a blocking connect (which takes order of minutes to return if a node is down). Thus, WorkerSender.run() will block and not send any successive notifications to other nodes.

        Let met know what you think

        I tired with adding timeouts to connectOne, but I am running into similar issue somewhere else. So that didnt fix the problem

        Show
        Vishal Kher added a comment - Hi Flavio, Ah! my trunk is quite old then. But I don't think it is necessary to run with the latest code for debugging this issue. I have identified one problem in WorkerSender.process(). This function calls manager.toSend() whicih calls connectOne. connectOne does a blocking connect (which takes order of minutes to return if a node is down). Thus, WorkerSender.run() will block and not send any successive notifications to other nodes. Let met know what you think I tired with adding timeouts to connectOne, but I am running into similar issue somewhere else. So that didnt fix the problem
        Hide
        Flavio Junqueira added a comment -

        Vishal, You don't seem to be using trunk code. The current trunk code would report notifications using the following format when report level info is enabled:

        LOG.info("Notification: " + n.leader + " (n.leader), " + n.zxid +
                        " (n.zxid), " + n.epoch + " (n.round), " + n.state +
                        " (n.state), " + n.sid + " (n.sid), " + self.getPeerState() +
                        " (my state)");
        

        And I'm seeing the following in the excerpt above:

        Notification: 0, 34359738368, 4, 0, LOOKING, LOOKING, 0
        

        Also, it would be great if we could use loggraph to visualize what is going on in your situation.

        Show
        Flavio Junqueira added a comment - Vishal, You don't seem to be using trunk code. The current trunk code would report notifications using the following format when report level info is enabled: LOG.info("Notification: " + n.leader + " (n.leader), " + n.zxid + " (n.zxid), " + n.epoch + " (n.round), " + n.state + " (n.state), " + n.sid + " (n.sid), " + self.getPeerState() + " (my state)"); And I'm seeing the following in the excerpt above: Notification: 0, 34359738368, 4, 0, LOOKING, LOOKING, 0 Also, it would be great if we could use loggraph to visualize what is going on in your situation.
        Hide
        Vishal Kher added a comment -

        at line 852 in 10.17.119.101-zookeeper.log WorkSender finally finds something in sendqueue and starts sending the notification to server 1.

        Show
        Vishal Kher added a comment - at line 852 in 10.17.119.101-zookeeper.log WorkSender finally finds something in sendqueue and starts sending the notification to server 1.
        Hide
        Vishal Kher added a comment -

        I am suspecting that one of the node (10.17.119.101) is not sending the notification to the other node. sendNotifications() is called to send notification to all peers. This functions enteres the notification in sendqueue. However, either the entry was not put in the queue (sendqueue.offer failed) or the thread that polls sendqueue did not wake up. I am not sure what the cause is yet.

        I had added extra debug messages. Three messages are of main interest:

        1. in sendNotifications(): Print "IN FLE sending notification to server id = 1" for each server. Also print "proposedLeader, proposedZxid, logicalclock"
        2. In FastLeaderElection.lookForLeader() print "Updating proposa" before calling upgradeProposal if (totalOrderPredicate(n.leader, n.zxid, proposedZxid) is true
        3. in WorkerSender.process(), log - LOG.info("WorkSender.process() QUEUEING m.state= " + m.state + " m.leader=" + m.leader + " m.sid=" + m.sid);

        Suppporting log entries from 10.17.119.101-zookeeper.log. I have added description inline..
        ------------------
        2010-08-18 14:53:56,451 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@475] - IN FLE sending notification to server id = 1
        2010-08-18 14:53:56,451 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@476] - proposedLeader, proposedZxid, logicalclock 0343597383684
        2010-08-18 14:53:56,452 - INFO [WorkerSender Thread:FastLeaderElection$Messenger$WorkerSender@352] - WorkSender.process() QUEUEING m.state= LOOKING m.leader=0 m.sid=1
        2010-08-18 14:53:56,452 - DEBUG [WorkerSender Thread:QuorumCnxManager@347] - Opening channel to server 1
        2010-08-18 14:53:56,453 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@475] - IN FLE sending notification to server id = 2
        2010-08-18 14:53:56,453 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@476] - proposedLeader, proposedZxid, logicalclock 0343597383684
        2010-08-18 14:53:56,453 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@690] - Notification: 1, 34359738368, 4, 0, LOOKING, LOOKING, 1
        2010-08-18 14:53:56,454 - DEBUG [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@496] - id: 1, proposed id: 0, zxid: 34359738368, proposed zxid: 34359738368
        2010-08-18 14:53:56,454 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@710] - Updating proposal
        2010-08-18 14:53:56,454 - INFO [WorkerSender Thread:QuorumCnxManager@162] - Have smaller server identifier, so dropping the connection: (1, 0)
        2010-08-18 14:53:56,455 - INFO [WorkerSender Thread:FastLeaderElection$Messenger$WorkerSender@352] - WorkSender.process() QUEUEING m.state= LOOKING m.leader=0 m.sid=2
        2010-08-18 14:53:56,458 - DEBUG [WorkerSender Thread:QuorumCnxManager@347] - Opening channel to server 2
        2010-08-18 14:53:56,458 - WARN [Thread-19:QuorumCnxManager$RecvWorker@659] - Connection broken:
        java.io.IOException: Channel eof
        at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:631)
        2010-08-18 14:53:56,459 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@475] - IN FLE sending notification to server id = 0

        2010-08-18 14:53:56,460 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@476] - proposedLeader, proposedZxid, logicalclock 1343597383684

                • The above line shows that this node (server 0) is going to vote for 1. see - proposedLeader, proposedZxid, logicalclock 1 34359738368 4 Forgot to add spaces in the message ****

        2010-08-18 14:53:56,460 - DEBUG [Thread-1:QuorumCnxManager$Listener@446] - Connection request /10.17.119.102:41597
        2010-08-18 14:53:56,461 - DEBUG [Thread-1:QuorumCnxManager$Listener@449] - Connection request: 0
        2010-08-18 14:53:56,461 - DEBUG [Thread-1:QuorumCnxManager$SendWorker@505] - Address of remote peer: 1
        2010-08-18 14:53:56,461 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@475] - IN FLE sending notification to server id = 1
        2010-08-18 14:53:56,462 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@476] - proposedLeader, proposedZxid, logicalclock 1343597383684

                • Above, server 0 queued a notification to be sent to server 1. The notfication is saying that it accepts 1 as the leader. But the notification never got sent. process() was not called at all from WorkerSender. Its almost as if the notification was never entered in sendqueue (in sendNotifications). *****

        2010-08-18 14:53:56,462 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@475] - IN FLE sending notification to server id = 2
        2010-08-18 14:53:56,462 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@476] - proposedLeader, proposedZxid, logicalclock 1343597383684
        2010-08-18 14:53:56,463 - DEBUG [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@716] - Adding vote: From = 1, Proposed leader = 1, Porposed zxid = 34359738368, Proposed epoch = 4
        2010-08-18 14:53:56,463 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@690] - Notification: 0, 34359738368, 4, 0, LOOKING, LOOKING, 0
        2010-08-18 14:53:56,463 - DEBUG [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@496] - id: 0, proposed id: 1, zxid: 34359738368, proposed zxid: 34359738368
        2010-08-18 14:53:56,463 - DEBUG [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@716] - Adding vote: From = 0, Proposed leader = 0, Porposed zxid = 34359738368, Proposed epoch = 4

        Show
        Vishal Kher added a comment - I am suspecting that one of the node (10.17.119.101) is not sending the notification to the other node. sendNotifications() is called to send notification to all peers. This functions enteres the notification in sendqueue. However, either the entry was not put in the queue (sendqueue.offer failed) or the thread that polls sendqueue did not wake up. I am not sure what the cause is yet. I had added extra debug messages. Three messages are of main interest: 1. in sendNotifications(): Print "IN FLE sending notification to server id = 1" for each server. Also print "proposedLeader, proposedZxid, logicalclock" 2. In FastLeaderElection.lookForLeader() print "Updating proposa" before calling upgradeProposal if (totalOrderPredicate(n.leader, n.zxid, proposedZxid) is true 3. in WorkerSender.process(), log - LOG.info("WorkSender.process() QUEUEING m.state= " + m.state + " m.leader=" + m.leader + " m.sid=" + m.sid); Suppporting log entries from 10.17.119.101-zookeeper.log. I have added description inline.. ------------------ 2010-08-18 14:53:56,451 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@475] - IN FLE sending notification to server id = 1 2010-08-18 14:53:56,451 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@476] - proposedLeader, proposedZxid, logicalclock 0343597383684 2010-08-18 14:53:56,452 - INFO [WorkerSender Thread:FastLeaderElection$Messenger$WorkerSender@352] - WorkSender.process() QUEUEING m.state= LOOKING m.leader=0 m.sid=1 2010-08-18 14:53:56,452 - DEBUG [WorkerSender Thread:QuorumCnxManager@347] - Opening channel to server 1 2010-08-18 14:53:56,453 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@475] - IN FLE sending notification to server id = 2 2010-08-18 14:53:56,453 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@476] - proposedLeader, proposedZxid, logicalclock 0343597383684 2010-08-18 14:53:56,453 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@690] - Notification: 1, 34359738368, 4, 0, LOOKING, LOOKING, 1 2010-08-18 14:53:56,454 - DEBUG [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@496] - id: 1, proposed id: 0, zxid: 34359738368, proposed zxid: 34359738368 2010-08-18 14:53:56,454 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@710] - Updating proposal 2010-08-18 14:53:56,454 - INFO [WorkerSender Thread:QuorumCnxManager@162] - Have smaller server identifier, so dropping the connection: (1, 0) 2010-08-18 14:53:56,455 - INFO [WorkerSender Thread:FastLeaderElection$Messenger$WorkerSender@352] - WorkSender.process() QUEUEING m.state= LOOKING m.leader=0 m.sid=2 2010-08-18 14:53:56,458 - DEBUG [WorkerSender Thread:QuorumCnxManager@347] - Opening channel to server 2 2010-08-18 14:53:56,458 - WARN [Thread-19:QuorumCnxManager$RecvWorker@659] - Connection broken: java.io.IOException: Channel eof at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:631) 2010-08-18 14:53:56,459 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@475] - IN FLE sending notification to server id = 0 2010-08-18 14:53:56,460 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@476] - proposedLeader, proposedZxid, logicalclock 1343597383684 The above line shows that this node (server 0) is going to vote for 1. see - proposedLeader, proposedZxid, logicalclock 1 34359738368 4 Forgot to add spaces in the message **** 2010-08-18 14:53:56,460 - DEBUG [Thread-1:QuorumCnxManager$Listener@446] - Connection request /10.17.119.102:41597 2010-08-18 14:53:56,461 - DEBUG [Thread-1:QuorumCnxManager$Listener@449] - Connection request: 0 2010-08-18 14:53:56,461 - DEBUG [Thread-1:QuorumCnxManager$SendWorker@505] - Address of remote peer: 1 2010-08-18 14:53:56,461 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@475] - IN FLE sending notification to server id = 1 2010-08-18 14:53:56,462 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@476] - proposedLeader, proposedZxid, logicalclock 1343597383684 Above, server 0 queued a notification to be sent to server 1. The notfication is saying that it accepts 1 as the leader. But the notification never got sent. process() was not called at all from WorkerSender. Its almost as if the notification was never entered in sendqueue (in sendNotifications). ***** 2010-08-18 14:53:56,462 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@475] - IN FLE sending notification to server id = 2 2010-08-18 14:53:56,462 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@476] - proposedLeader, proposedZxid, logicalclock 1343597383684 2010-08-18 14:53:56,463 - DEBUG [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@716] - Adding vote: From = 1, Proposed leader = 1, Porposed zxid = 34359738368, Proposed epoch = 4 2010-08-18 14:53:56,463 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@690] - Notification: 0, 34359738368, 4, 0, LOOKING, LOOKING, 0 2010-08-18 14:53:56,463 - DEBUG [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@496] - id: 0, proposed id: 1, zxid: 34359738368, proposed zxid: 34359738368 2010-08-18 14:53:56,463 - DEBUG [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@716] - Adding vote: From = 0, Proposed leader = 0, Porposed zxid = 34359738368, Proposed epoch = 4
        Hide
        Flavio Junqueira added a comment -

        Hi Vishal, Thanks for reporting. Here are some quick comments:

        Issue 1: I think that just the javadoc message is incorrect. We really just want to check that some process has received notifications.

        Issue2: The connection will eventually timeout if not established, so setting a different value should not make a difference. The point about blocking connect is a good one. I think it is worthwhile creating a jira for it.

        Show
        Flavio Junqueira added a comment - Hi Vishal, Thanks for reporting. Here are some quick comments: Issue 1: I think that just the javadoc message is incorrect. We really just want to check that some process has received notifications. Issue2: The connection will eventually timeout if not established, so setting a different value should not make a difference. The point about blocking connect is a good one. I think it is worthwhile creating a jira for it.
        Hide
        Vishal Kher added a comment -

        While going through the code yesterday, I found two potential problems that I though might be worth reporting in the context of this bug.

        1. In FastLeaderElection.java
        /**

        • Check if all queues are empty, indicating that all messages have been delivered.
          */
          boolean haveDelivered() {
          for (ArrayBlockingQueue<ByteBuffer> queue : queueSendMap.values()) { LOG.debug("Queue size: " + queue.size()); if (queue.size() == 0) return true; }

        return false;
        }

        the haveDelivered() function returns true without checking if rest of the queus are empty.

        2. QuorumCnxManager.connectAll() function connects to one peer at a time and it uses a blocking connect (SocketChannle.open). I added a timeout to the SocketChannel.open and that did not fix the problem.

        Show
        Vishal Kher added a comment - While going through the code yesterday, I found two potential problems that I though might be worth reporting in the context of this bug. 1. In FastLeaderElection.java /** Check if all queues are empty, indicating that all messages have been delivered. */ boolean haveDelivered() { for (ArrayBlockingQueue<ByteBuffer> queue : queueSendMap.values()) { LOG.debug("Queue size: " + queue.size()); if (queue.size() == 0) return true; } return false; } the haveDelivered() function returns true without checking if rest of the queus are empty. 2. QuorumCnxManager.connectAll() function connects to one peer at a time and it uses a blocking connect (SocketChannle.open). I added a timeout to the SocketChannel.open and that did not fix the problem.
        Hide
        Vishal Kher added a comment -

        Hi Ivan,

        Per your suggetion, I have reproduced the same problem on Red Hat Enterprise Linux Server release 5.5.
        (3 RHEL VMs on 3 ESX).

        I used the code from trunk (may be a week old but that shouldn't matter).
        The leader election with the latest code is taking way more time to finish (2 mins 43 seconds).

        The reproduction procedure was the same :

        • reboot -n the leader.
        • let the rebooted node come up
        • start zookeeper on the rebooted node
        • repeat until we run into the issue

        Look for the last round of election (marked ****** Rebooting LEADER) in 10.17.119.101-zookeeper.log and 10.17.119.102-zookeeper.log.

        Thanks.
        -Vishal

        Show
        Vishal Kher added a comment - Hi Ivan, Per your suggetion, I have reproduced the same problem on Red Hat Enterprise Linux Server release 5.5. (3 RHEL VMs on 3 ESX). I used the code from trunk (may be a week old but that shouldn't matter). The leader election with the latest code is taking way more time to finish (2 mins 43 seconds). The reproduction procedure was the same : reboot -n the leader. let the rebooted node come up start zookeeper on the rebooted node repeat until we run into the issue Look for the last round of election (marked ****** Rebooting LEADER) in 10.17.119.101-zookeeper.log and 10.17.119.102-zookeeper.log. Thanks. -Vishal
        Hide
        Vishal Kher added a comment -

        Hi Ivan,

        Can you describe me your setup?

        My setup info:

        • 3 ESX boxes
        • 1 SLES 11 VMs on each
        • Cluster of 3 nodes

        I hit this problem consistently after rebooting the leader.

        Thanks.
        -Vishal

        Show
        Vishal Kher added a comment - Hi Ivan, Can you describe me your setup? My setup info: 3 ESX boxes 1 SLES 11 VMs on each Cluster of 3 nodes I hit this problem consistently after rebooting the leader. Thanks. -Vishal
        Hide
        Ivan Kelly added a comment -

        Actually, ignore that. Read it wrong.

        Show
        Ivan Kelly added a comment - Actually, ignore that. Read it wrong.
        Hide
        Ivan Kelly added a comment -
        Show
        Ivan Kelly added a comment - Could this be related? https://issues.apache.org/jira/browse/ZOOKEEPER-785
        Hide
        Ivan Kelly added a comment -

        They logs were of some help, but I don't understand what's happening. I looks like multiple nodes are claiming leadership at the same time, but that can't be right.

        The FLE changes won't fix it, but they do log more information, so they may make it easier to see what is happening.

        Show
        Ivan Kelly added a comment - They logs were of some help, but I don't understand what's happening. I looks like multiple nodes are claiming leadership at the same time, but that can't be right. The FLE changes won't fix it, but they do log more information, so they may make it easier to see what is happening.
        Hide
        Vishal Kher added a comment -

        Hi Ivan.

        Were the logs of any help? I might be worth having 3 VMs and rebooting the leader instead of shutting down the interface. We have seen this on all of our dev cluster. Al tough all the dev clusters are based on same VM images. So it won't be fair to claim that the problem was seen on different set of machines. I will try with the latest trunk and let you know the result. What FLE changes do you think would have fixed this problem?

        Thanks.

        -Vishal

        Show
        Vishal Kher added a comment - Hi Ivan. Were the logs of any help? I might be worth having 3 VMs and rebooting the leader instead of shutting down the interface. We have seen this on all of our dev cluster. Al tough all the dev clusters are based on same VM images. So it won't be fair to claim that the problem was seen on different set of machines. I will try with the latest trunk and let you know the result. What FLE changes do you think would have fixed this problem? Thanks. -Vishal
        Hide
        Ivan Kelly added a comment -

        I've tried to repro this with 3 zookeepers running on the same machine, and 3 zookeepers running on virtual machine and I cannot get it to repro. I was taking out the leader by shutting down the network interface. Have you been able to repro this on another set of machines other than the ones you first observed it on?

        Also, could you try this with the latest trunk? Some improvements were made around the FLE which may shed some more light.

        Show
        Ivan Kelly added a comment - I've tried to repro this with 3 zookeepers running on the same machine, and 3 zookeepers running on virtual machine and I cannot get it to repro. I was taking out the leader by shutting down the network interface. Have you been able to repro this on another set of machines other than the ones you first observed it on? Also, could you try this with the latest trunk? Some improvements were made around the FLE which may shed some more light.
        Hide
        Vishal Kher added a comment -

        I have attached new logs. I don't use ntp, but all l the nodes should be at the most a few seconds apart.

        I have marked start and end of the faulty election. look at zookeeper-192.168.10.3-log and search for "vishal",

        Note - it is super easy to reproduce the bug. Create a 3 node cluster and reboot the leader (or shutdown the network interface). You may need to repeat the test several times.

        If you do a clean shutdown of the leader (zkServer.sh stop), then you won't see this bug. I feel that there is something releated to TCP timeout/ session management of failed node that is causing this problem.

        Show
        Vishal Kher added a comment - I have attached new logs. I don't use ntp, but all l the nodes should be at the most a few seconds apart. I have marked start and end of the faulty election. look at zookeeper-192.168.10.3-log and search for "vishal", Note - it is super easy to reproduce the bug. Create a 3 node cluster and reboot the leader (or shutdown the network interface). You may need to repeat the test several times. If you do a clean shutdown of the leader (zkServer.sh stop), then you won't see this bug. I feel that there is something releated to TCP timeout/ session management of failed node that is causing this problem.
        Hide
        Vishal Kher added a comment -

        attaching new logs.

        Show
        Vishal Kher added a comment - attaching new logs.
        Hide
        Ivan Kelly added a comment -

        currently the timestamp is used as the zxid isn't always available in the message logs. But yes, zxid would be more desirable. Perhaps I can extract that from the context some other way.

        Show
        Ivan Kelly added a comment - currently the timestamp is used as the zxid isn't always available in the message logs. But yes, zxid would be more desirable. Perhaps I can extract that from the context some other way.
        Hide
        Patrick Hunt added a comment -

        Ivan, do the clocks need to be in sync? Perhaps you should use the xid (cxid) instead?

        Show
        Patrick Hunt added a comment - Ivan, do the clocks need to be in sync? Perhaps you should use the xid (cxid) instead?
        Hide
        Ivan Kelly added a comment -

        Hi Vishal,

        The logs in zk_leader_election.tar.gz seem to be from different runs. node0 starts at 2010-07-22 17:33:54,166, node1 at 2010-07-22 22:21:11,979 and node2 at 2010-07-22 22:22:17,249. Are the clocks on the machine in sync?

        -Ivan

        Show
        Ivan Kelly added a comment - Hi Vishal, The logs in zk_leader_election.tar.gz seem to be from different runs. node0 starts at 2010-07-22 17:33:54,166, node1 at 2010-07-22 22:21:11,979 and node2 at 2010-07-22 22:22:17,249. Are the clocks on the machine in sync? -Ivan
        Hide
        Flavio Junqueira added a comment -

        Thanks for the logs, Vishal. The jira discussing loggraph is ZOOKEEPER-773.

        Show
        Flavio Junqueira added a comment - Thanks for the logs, Vishal. The jira discussing loggraph is ZOOKEEPER-773 .
        Hide
        Vishal Kher added a comment -

        Hi Flavio,

        I have attached the logs. Hope this helps. Do you have more info on loggraph (doc, etc)?

        Thanks.

        Show
        Vishal Kher added a comment - Hi Flavio, I have attached the logs. Hope this helps. Do you have more info on loggraph (doc, etc)? Thanks.
        Hide
        Flavio Junqueira added a comment -

        Hi Vishal, Do you think you can uploaded all three log files for a problematic run? We would like to put it on loggraph to visualize what's going on there. It sounds like it is somehow related to the VM reboots, I don't know why yet.

        Show
        Flavio Junqueira added a comment - Hi Vishal, Do you think you can uploaded all three log files for a problematic run? We would like to put it on loggraph to visualize what's going on there. It sounds like it is somehow related to the VM reboots, I don't know why yet.
        Hide
        Vishal Kher added a comment -

        Hi Flavio,

        I have Zookeeper servers running in a VM. To kill ZK server, I power off a VM. On the other hand, I tried several times killing ZK process and restarting it and I did not see any issues.
        So there is something about the reboot that is causing this problem (TCP session not getting cleaned-up?).

        I don't see many connection exceptions in the log.

        Once the leader election starts we start seeing "Notification time out" messages.

        However, before this we do see that the connection was established (show below):

        2010-07-19 14:40:52,562 - DEBUG [WorkerSender Thread:QuorumCnxManager@366] - There is a connection already for server 0
        2010-07-19 14:40:52,563 - DEBUG [WorkerSender Thread:QuorumCnxManager@346] - Opening channel to server 2

        Do you still think this is a communication problem?

        Show
        Vishal Kher added a comment - Hi Flavio, I have Zookeeper servers running in a VM. To kill ZK server, I power off a VM. On the other hand, I tried several times killing ZK process and restarting it and I did not see any issues. So there is something about the reboot that is causing this problem (TCP session not getting cleaned-up?). I don't see many connection exceptions in the log. Once the leader election starts we start seeing "Notification time out" messages. However, before this we do see that the connection was established (show below): 2010-07-19 14:40:52,562 - DEBUG [WorkerSender Thread:QuorumCnxManager@366] - There is a connection already for server 0 2010-07-19 14:40:52,563 - DEBUG [WorkerSender Thread:QuorumCnxManager@346] - Opening channel to server 2 Do you still think this is a communication problem?
        Hide
        Ivan Kelly added a comment -

        Could you try putting the logs through loggraph (in zookeeper/src/contrib)? Perhaps a graphical view will give some insight?

        Show
        Ivan Kelly added a comment - Could you try putting the logs through loggraph (in zookeeper/src/contrib)? Perhaps a graphical view will give some insight?
        Hide
        Flavio Junqueira added a comment -

        Vishal, I can't reproduce your problem. I just tried twice to kill the leader and rejoin it 20 times each, and I can't see the problem you're mentioning. I wonder if there is anything special about your setup. I also can see in your logs lots of exceptions related to connections, and as a first cut, it sounds like this is preventing the severs from exchanging notifications, and therefore the delay.

        Two minor comments: your log file for server 2 does not contain "START HERE" and each file duplicates every message.

        Show
        Flavio Junqueira added a comment - Vishal, I can't reproduce your problem. I just tried twice to kill the leader and rejoin it 20 times each, and I can't see the problem you're mentioning. I wonder if there is anything special about your setup. I also can see in your logs lots of exceptions related to connections, and as a first cut, it sounds like this is preventing the severs from exchanging notifications, and therefore the delay. Two minor comments: your log file for server 2 does not contain "START HERE" and each file duplicates every message.
        Hide
        Vishal Kher added a comment -

        I would like that add that the problem is highly reproducible.

        Show
        Vishal Kher added a comment - I would like that add that the problem is highly reproducible.
        Hide
        Vishal Kher added a comment -

        Attaching logs from two nodes that took too long to complete leader election.

        Show
        Vishal Kher added a comment - Attaching logs from two nodes that took too long to complete leader election.

          People

          • Assignee:
            Vishal Kher
            Reporter:
            Vishal Kher
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development