Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-4771

Fast leader election taking too long

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Blocker
    • Resolution: Unresolved
    • 3.4.10
    • None
    • leaderElection
    • None

    Description

      Hello ZooKeeper Community,
       
      Background:
      We are using ZooKeeper version 3.4.10. in our system and we have 5 Zookeeper servers running, that are distributed across 2 clusters of servers.
      In the first cluster, we have 3 Zookeeper servers, each deployed on its own machine, and in the second cluster we have 2 Zookeeper servers, also each on its own machine. Zookeeper servers that are distributed on the same cluster communicate through the local network, and with the servers on the remote cluster through an external network.
      The situation is the following:

       

      Cluster 1
      Zookeeper server 10
      Zookeeper server 11
      Zookeeper server 12 -> Leader
      
      Cluster 2
      Zookeeper server 20
      Zookeeper server 21
      

       

      Problem:
      We have an issue with Fast Leader Election when we kill the ZooKeeper leader process.
      After the leader (server 12) is killed and leader election starts, we can see in Zookeeper logs that voting notifications are exchanged from each Zookeeper server that remained alive towards all the others. Notification on Zookeeper servers located in the same cluster (communicating over the local network) are successfully exchanged. The problem seems to be for Zookeeper server sending votes over external network as according to the logs they are only sent in one direction.
       
      Logs from zookeeper server 10:

      Nov 22 10:31:13 sc_2_1 BC[myid: 10]: INFO  LOOKING
      Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Initializing leader election protocol...
      Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Updating proposal: 10 (newleader), 0xe00009c97 (newzxid), 12 (oldleader), 0xd00001380 (oldzxid)
      Nov 22 10:31:13 sc_2_1 BC[myid: 10]: INFO  New election. My id =  10, proposed zxid=0xe00009c97
      Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 20 (recipient), 10 (myid), 0xe (n.peerEpoch)
      Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 21 (recipient), 10 (myid), 0xe (n.peerEpoch)
      Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 10 (recipient), 10 (myid), 0xe (n.peerEpoch)
      Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 11 (recipient), 10 (myid), 0xe (n.peerEpoch)
      Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 12 (recipient), 10 (myid), 0xe (n.peerEpoch)
      Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed leader=10, proposed zxid=0xe00009c97, proposed election epoch=0x11
      Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=11, proposed leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11
      Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed leader=10, proposed zxid=0xe00009c97, proposed election epoch=0x11
      Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11
      Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11
      Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=11, proposed leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11
      Nov 22 10:31:15 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11
      Nov 22 10:31:15 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=11, proposed leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11

      Logs from zookeeper server 20:

      Nov 22 10:31:13 sc_2_1 BC[myid: 20]: INFO  LOOKING
      Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Initializing leader election protocol...
      Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 20 (recipient), 20 (myid), 0xe (n.peerEpoch)
      Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 21 (recipient), 20 (myid), 0xe (n.peerEpoch)
      Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 10 (recipient), 20 (myid), 0xe (n.peerEpoch)
      Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 11 (recipient), 20 (myid), 0xe (n.peerEpoch)
      Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 12 (recipient), 20 (myid), 0xe (n.peerEpoch)
      Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=11, proposed leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11
      Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=20, proposed leader=20, proposed zxid=0xe00009c97, proposed election epoch=0x11
      Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=21, proposed leader=21, proposed zxid=0xe00009c97, proposed election epoch=0x11
      Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=10, proposed leader=10, proposed zxid=0xe00009c97, proposed election epoch=0x11
      Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 20 (recipient), 20 (myid), 0xe (n.peerEpoch)
      Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 21 (recipient), 20 (myid), 0xe (n.peerEpoch)
      Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 10 (recipient), 20 (myid), 0xe (n.peerEpoch)
      Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 11 (recipient), 20 (myid), 0xe (n.peerEpoch)
      Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 12 (recipient), 20 (myid), 0xe (n.peerEpoch)
      Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=21, proposed leader=21, proposed zxid=0xe00009c97, proposed election epoch=0x11
      Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=10, proposed leader=10, proposed zxid=0xe00009c97, proposed election epoch=0x11
      Nov 22 10:31:14 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=10, proposed leader=10, proposed zxid=0xe00009c97, proposed election epoch=0x11

      These are just some of the lines from the log relevant to the issue and complete logs from all the servers are attached as well as the server configuration.

      As it can be seen from the logs the server with ID 20 is sending notifications to all cluster members but in the logs from server 10 there is no indication that a notification from server 20 is received. However, it is obvious from the logs that server 20 received notifications from server 10 at the same time. This part is a bit strange since the same socket is used for communication in both directions.
      The first suspect was the network, but after investigation by following tcpdump we could not find any packages leaving from the cluster where the Zk server 20 is located which is sending notifications.

      About almost half a minute later, the server realizes that there is no connection to Zk servers 10 and 11 and then reopens channels towards them. After that, the notifications are exchanged and the quorum is formed quickly after:

      Nov 22 10:31:40 sc_2_1 BC[myid: 20]: WARN  Connection broken for id 11, my id = 20, error = #012java.net.SocketException: Connection timed out (Read failed)#012#011at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_261]#012#011at java.net.SocketInputStream.socketRead(Unknown Source) ~[?:1.8.0_261]#012#011at java.net.SocketInputStream.read(Unknown Source) ~[?:1.8.0_261]#012#011at java.net.SocketInputStream.read(Unknown Source) ~[?:1.8.0_261]#012#011at java.io.BufferedInputStream.fill(Unknown Source) ~[?:1.8.0_261]#012#011at java.io.BufferedInputStream.read(Unknown Source) ~[?:1.8.0_261]#012#011at java.io.DataInputStream.readInt(Unknown Source) ~[?:1.8.0_261]#012#011at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013) [zookeeper-3.4.10.jar:3.4.10-5f5438876cc9050ce4380a2780ce07d063aed8d7]
      Nov 22 10:31:40 sc_2_1 BC[myid: 20]: WARN  Connection broken for id 10, my id = 20, error = #012java.net.SocketException: Connection timed out (Read failed)#012#011at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_261]#012#011at java.net.SocketInputStream.socketRead(Unknown Source) ~[?:1.8.0_261]#012#011at java.net.SocketInputStream.read(Unknown Source) ~[?:1.8.0_261]#012#011at java.net.SocketInputStream.read(Unknown Source) ~[?:1.8.0_261]#012#011at java.io.BufferedInputStream.fill(Unknown Source) ~[?:1.8.0_261]#012#011at java.io.BufferedInputStream.read(Unknown Source) ~[?:1.8.0_261]#012#011at java.io.DataInputStream.readInt(Unknown Source) ~[?:1.8.0_261]#012#011at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013) [zookeeper-3.4.10.jar:3.4.10-5f5438876cc9050ce4380a2780ce07d063aed8d7]
      .....
      Nov 22 10:31:41 sc_2_1 BC[myid: 20]: INFO  FOLLOWING
      Nov 22 10:31:41 sc_2_1 BC[myid: 20]: INFO  Created server with tickTime 1500 minSessionTimeout 3000 maxSessionTimeout 30000 datadir /local/version-2 snapdir /local/version-2
      Nov 22 10:31:41 sc_2_1 BC[myid: 20]: INFO  FOLLOWING - LEADER ELECTION TOOK - 27333

      We have two main questions related to this issue if you could help:

      1) Is there any way to know for sure that the vote notification was actually sent other than the following log:

      Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 10 (recipient), 20 (myid), 0xe (n.peerEpoch)

      Or is there any chance that the receiving server would discard the notification for some reason and not even log the event, since we cannot see anything in the logs even with debug enabled?

      2) Are there any options to reduce this timeout since half a minute for leader election is too much for our highly available product with only 5 zookeeper servers? Is there any configuration parameter that could be tuned?

      Also, do you know if the same or similar issue has been already reported or fixed in some of the later versions of the product? We could not find something similar in Zookeeper Jira.

       
      Thanks and regards,
      Ivo
       

      Attachments

        1. zookeeper21.log
          4.07 MB
          Ivo Vrdoljak
        2. zookeeper20.log
          3.52 MB
          Ivo Vrdoljak
        3. zookeeper12.log
          4.87 MB
          Ivo Vrdoljak
        4. zookeeper11.log
          4.94 MB
          Ivo Vrdoljak
        5. zookeeper10.log
          4.91 MB
          Ivo Vrdoljak
        6. zookeeper10.conf
          2 kB
          Ivo Vrdoljak

        Activity

          People

            Unassigned Unassigned
            eivovrd Ivo Vrdoljak
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: