Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-2167

Restarting current leader node sometimes results in a permanent loss of quorum

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.4.6
    • None
    • None
    • None

    Description

      I'm seeing an issue where a restart of the current leader node results in a long-term / permanent loss of quorum (I've only waited 30 minutes, but it doesn't look like it's making any progress). Restarting the same instance again seems to resolve the problem.

      To me, this looks a lot like the issue described in https://issues.apache.org/jira/browse/ZOOKEEPER-1026, but I'm filing this separately for the moment in case I am wrong.

      Notes on the attached log:
      1) If you search for XXX in the log, you'll see where I've annotated it to include where the process was told to terminate, when it is reported to have completed that, and then the same for the start
      2) To save you the trouble of figuring it out, here's the zkid <=> ip mapping:
      zid=1, ip=10.20.0.19
      zid=2, ip=10.20.0.18
      zid=3, ip=10.20.0.20
      zid=4, ip=10.20.0.21
      zid=5, ip=10.20.0.22
      3) It's important to note that this is log is during the process of a rolling service restart to remove an instance; in this case, zid #2 / 10.20.0.18 is the one being removed, so if you see a conspicuous silence from that service, that's why.
      4) I've been unable to reproduce this problem except during cluster size changes, so I suspect that may be related; it's also important to note that this test is going from 5 -> 4 (which means, since we remove one and then do a rolling restart, we are actually temporarily dropping to 3). I know this is not a recommended thing (this is more of a stress test). We have seen this same problem on larger cluster sizes, it just seems easier to reproduce it on smaller sizes.
      5) The log starts roughly at the point 10.20.0.21 / zid=4 wins the election during the final quorum; zid=4 is the one whose shutdown triggers the problem.

      Attachments

        1. fails-to-rejoin-quorum.gz
          598 kB
          Mike Lundy
        2. fails-to-rejoin-quorum-longer.gz
          934 kB
          Mike Lundy

        Activity

          People

            Unassigned Unassigned
            novas0x2a Mike Lundy
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: