ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1548

Cluster fails election loop in new and interesting way

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 3.4.3
    • Fix Version/s: 3.4.6
    • Component/s: leaderElection
    • Labels:
      None

      Description

      Hi,

      We have a five node cluster, recently upgraded from 3.3.5 to 3.4.3. Was running fine for a few weeks after the upgrade, then the following sequence of events occurred :

      1. All servers stopped responding to 'ruok' at the same time
      2. Our local supervisor process restarted all of them at the same time
      (yes, this is bad, we didn't expect it to fail this way
      3. The cluster would not serve requests after this. Appeared to be unable to complete an election.

      We tried various things at this point, none of which worked :

      • Moved around the restart order of the nodes (e.g. 4 thru 0, instead of 0 thru 4)
      • Reduced number of running nodes from 5 -> 3 to simplify the quorum, by only starting up 0, 1 & 2, in one test, and 0, 2 & 4 in the other
      • Removed the *Epoch files from version-2/ snapshot directory
      • Put the same version2/snapshot.xxxxx file on each server in the cluster
      • Added the (same on all nodes) last txlog onto each cluster
      • Kept only the last snapshot plus txlog unique on each server
      • Moved leaderServes=no to leaderServes=yes
      • Removed all files and started up with empty data as a control. This worked, but of course isn't terribly useful

      Finally, I brought the data up on a single node running in standalone and this worked (yay!) So at this point we brought the single node back into service and have kept the other four available to debug why the election is failing.

      We downgraded the four nodes to 3.3.5, and then they completed the election and started serving as expected.
      We did a rolling upgrade to 3.4.3, and everything was fine until we restarted the leader, whereupon we encountered the same re-election loop as before.

      We're a bit out of ideas at this point, so I was hoping someone from this list might have some useful input.

      Output from two followers and a leader during this condition are attached.

      Cheers,

      Al

      1. 1-follower
        6 kB
        Alan Horn
      2. 2-follower
        6 kB
        Alan Horn
      3. 3-leader
        9 kB
        Alan Horn

        Issue Links

          Activity

          Hide
          Flavio Junqueira added a comment -

          Closing issues after releasing 3.4.6.

          Show
          Flavio Junqueira added a comment - Closing issues after releasing 3.4.6.
          Hide
          Flavio Junqueira added a comment -

          I'm resolving this one as a duplicate of ZOOKEEPER-1115.

          Show
          Flavio Junqueira added a comment - I'm resolving this one as a duplicate of ZOOKEEPER-1115 .
          Hide
          Germán Blanco added a comment -

          I think in both 1548 and 1115 the problem is about the follower not being able to ack on time due to the time it takes to write the snapshot.

          Show
          Germán Blanco added a comment - I think in both 1548 and 1115 the problem is about the follower not being able to ack on time due to the time it takes to write the snapshot.
          Hide
          Ross Cohen added a comment -

          It appears the issue is that the syncLimit is not long enough to cover the initial snapshotting. Increasing the syncLimit allows it to work, but it doesn't feel like a good solution. Would you accept a patch which used initLimit (or a new configuration parameter) when waiting for a snapshot to complete?

          Show
          Ross Cohen added a comment - It appears the issue is that the syncLimit is not long enough to cover the initial snapshotting. Increasing the syncLimit allows it to work, but it doesn't feel like a good solution. Would you accept a patch which used initLimit (or a new configuration parameter) when waiting for a snapshot to complete?
          Hide
          Alan Horn added a comment -

          Hey Flavio,

          How would I print the content of the synced set ? Sorry, I'm a bit new with zookeeper

          Cheers,

          Al

          Show
          Alan Horn added a comment - Hey Flavio, How would I print the content of the synced set ? Sorry, I'm a bit new with zookeeper Cheers, Al
          Hide
          Flavio Junqueira added a comment -

          I can't see any problem with the configuration file, but thanks for posting.

          I'm not sure why the size of synced is 2 and not 3. Given that the leader has synchronized with the other two followers, I would expect the size of synced to be 3. If you can reproduce it easily, perhaps you could print the content of the synced set. From the logs I couldn't see anything else suspicious.

          I suppose that the pattern you posted keep repeating itself indefinitely. If there is any difference, it would be good to see so that we can determine if it is a race. On my end, I'll see if I observe it as well and will report back.

          Show
          Flavio Junqueira added a comment - I can't see any problem with the configuration file, but thanks for posting. I'm not sure why the size of synced is 2 and not 3. Given that the leader has synchronized with the other two followers, I would expect the size of synced to be 3. If you can reproduce it easily, perhaps you could print the content of the synced set. From the logs I couldn't see anything else suspicious. I suppose that the pattern you posted keep repeating itself indefinitely. If there is any difference, it would be good to see so that we can determine if it is a race. On my end, I'll see if I observe it as well and will report back.
          Hide
          Alan Horn added a comment -

          Sure, here you go. I've sanitized the hostnames, but the rest is as it is on the nodes :
          myid files are consistent, and the same config is on all five nodes.

          # zk members
          server.0=zookeeper0:2889:3888
          server.1=zookeeper1:2889:3888
          server.2=zookeeper2:2889:3888
          server.3=zookeeper3:2889:3888
          server.4=zookeeper4:2889:3888
          # The number of milliseconds of each tick
          tickTime=2000
          # The number of ticks that the initial
          # synchronization phase can take
          initLimit=20
          # The number of ticks that can pass between
          # sending a request and getting an acknowledgement
          syncLimit=10
          # the directory where the snapshot is stored.
          dataDir=/data/zookeeper
          # where txlog  are written
          dataLogDir=/data/zookeeper/txlog
          # the port at which the clients will connect
          clientPort=2181
          # limit on queued clients - default: 1000
          globalOutstandingLimit=1000
          
          # number of transactions before snapshots are taken - default: 100000
          snapCount=100000
          # max # of clients - 0==unlimited
          maxClientCnxns=150
          # Election implementation to use. A value of "0" corresponds to the original
          # UDP-based version, "1" corresponds to the non-authenticated UDP-based
          # version of fast leader election, "2" corresponds to the authenticated
          # UDP-based version of fast leader election, and "3" corresponds to TCP-based
          # version of fast leader election. Currently, only 0 and 3 are supported,
          # 3 being the default
          electionAlg=3
          # Leader accepts client connections. Default value is "yes". The leader
          # machine coordinates updates. For higher update throughput at thes slight
          # expense of read throughput the leader can be configured to not accept
          # clients and focus on coordination.
          leaderServes=yes
          # Skips ACL checks. This results in a boost in throughput, but opens up full
          # access to the data tree to everyone.
          skipACL=yes
          
          Show
          Alan Horn added a comment - Sure, here you go. I've sanitized the hostnames, but the rest is as it is on the nodes : myid files are consistent, and the same config is on all five nodes. # zk members server.0=zookeeper0:2889:3888 server.1=zookeeper1:2889:3888 server.2=zookeeper2:2889:3888 server.3=zookeeper3:2889:3888 server.4=zookeeper4:2889:3888 # The number of milliseconds of each tick tickTime=2000 # The number of ticks that the initial # synchronization phase can take initLimit=20 # The number of ticks that can pass between # sending a request and getting an acknowledgement syncLimit=10 # the directory where the snapshot is stored. dataDir=/data/zookeeper # where txlog are written dataLogDir=/data/zookeeper/txlog # the port at which the clients will connect clientPort=2181 # limit on queued clients - default: 1000 globalOutstandingLimit=1000 # number of transactions before snapshots are taken - default: 100000 snapCount=100000 # max # of clients - 0==unlimited maxClientCnxns=150 # Election implementation to use. A value of "0" corresponds to the original # UDP-based version, "1" corresponds to the non-authenticated UDP-based # version of fast leader election, "2" corresponds to the authenticated # UDP-based version of fast leader election, and "3" corresponds to TCP-based # version of fast leader election. Currently, only 0 and 3 are supported, # 3 being the default electionAlg=3 # Leader accepts client connections. Default value is "yes". The leader # machine coordinates updates. For higher update throughput at thes slight # expense of read throughput the leader can be configured to not accept # clients and focus on coordination. leaderServes=yes # Skips ACL checks. This results in a boost in throughput, but opens up full # access to the data tree to everyone. skipACL=yes
          Hide
          Flavio Junqueira added a comment -

          Leader election seems to complete but the leader abandons leadership. Here is something suspicious:

          java.lang.Exception: shutdown Leader! reason: Only 2 followers, need 2
           at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:496)
            at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:471)
             at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:753)
          

          To get to this message, we need this predicate to hold true:

          (!tickSkip && !self.getQuorumVerifier().containsQuorum(syncedSet))
          

          and I don't understand yet why containsQuorum is false. Could you share your configuration file as well, please?

          Show
          Flavio Junqueira added a comment - Leader election seems to complete but the leader abandons leadership. Here is something suspicious: java.lang.Exception: shutdown Leader! reason: Only 2 followers, need 2 at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:496) at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:471) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:753) To get to this message, we need this predicate to hold true: (!tickSkip && !self.getQuorumVerifier().containsQuorum(syncedSet)) and I don't understand yet why containsQuorum is false. Could you share your configuration file as well, please?

            People

            • Assignee:
              Unassigned
              Reporter:
              Alan Horn
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development