ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1548

Cluster fails election loop in new and interesting way

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 3.4.3
    • Fix Version/s: 3.4.6
    • Component/s: leaderElection
    • Labels:
      None

      Description

      Hi,

      We have a five node cluster, recently upgraded from 3.3.5 to 3.4.3. Was running fine for a few weeks after the upgrade, then the following sequence of events occurred :

      1. All servers stopped responding to 'ruok' at the same time
      2. Our local supervisor process restarted all of them at the same time
      (yes, this is bad, we didn't expect it to fail this way
      3. The cluster would not serve requests after this. Appeared to be unable to complete an election.

      We tried various things at this point, none of which worked :

      • Moved around the restart order of the nodes (e.g. 4 thru 0, instead of 0 thru 4)
      • Reduced number of running nodes from 5 -> 3 to simplify the quorum, by only starting up 0, 1 & 2, in one test, and 0, 2 & 4 in the other
      • Removed the *Epoch files from version-2/ snapshot directory
      • Put the same version2/snapshot.xxxxx file on each server in the cluster
      • Added the (same on all nodes) last txlog onto each cluster
      • Kept only the last snapshot plus txlog unique on each server
      • Moved leaderServes=no to leaderServes=yes
      • Removed all files and started up with empty data as a control. This worked, but of course isn't terribly useful

      Finally, I brought the data up on a single node running in standalone and this worked (yay!) So at this point we brought the single node back into service and have kept the other four available to debug why the election is failing.

      We downgraded the four nodes to 3.3.5, and then they completed the election and started serving as expected.
      We did a rolling upgrade to 3.4.3, and everything was fine until we restarted the leader, whereupon we encountered the same re-election loop as before.

      We're a bit out of ideas at this point, so I was hoping someone from this list might have some useful input.

      Output from two followers and a leader during this condition are attached.

      Cheers,

      Al

      1. 1-follower
        6 kB
        Alan Horn
      2. 2-follower
        6 kB
        Alan Horn
      3. 3-leader
        9 kB
        Alan Horn

        Issue Links

          Activity

          Alan Horn created issue -
          Alan Horn made changes -
          Field Original Value New Value
          Attachment 1-follower [ 12544301 ]
          Attachment 2-follower [ 12544302 ]
          Attachment 3-leader [ 12544303 ]
          Mahadev konar made changes -
          Fix Version/s 3.5.0 [ 12316644 ]
          Fix Version/s 3.4.5 [ 12321883 ]
          Mahadev konar made changes -
          Fix Version/s 3.4.6 [ 12323310 ]
          Fix Version/s 3.5.0 [ 12316644 ]
          Fix Version/s 3.4.5 [ 12321883 ]
          Germán Blanco made changes -
          Link This issue is related too ZOOKEEPER-1115 [ ZOOKEEPER-1115 ]
          Gavin made changes -
          Link This issue is related to ZOOKEEPER-1115 [ ZOOKEEPER-1115 ]
          Gavin made changes -
          Link This issue is related to ZOOKEEPER-1115 [ ZOOKEEPER-1115 ]
          Flavio Junqueira made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Duplicate [ 3 ]
          Flavio Junqueira made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Unassigned
              Reporter:
              Alan Horn
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development