Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-1075

Zookeeper Server cannot join an existing ensemble if the existing ensemble doesn't already have a quorum

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Not A Problem
    • Affects Version/s: 3.3.2
    • Fix Version/s: None
    • Component/s: leaderElection
    • Labels:
      None
    • Environment:

      Windows 7

      Description

      Here is the sequence of steps that reproduces the problem.
      On a 3 server ensemble,
      1. Bring up two servers (say 1 and 2). Lets say 1 is leading.
      2. Bring down 2
      3. Bring up 2.
      4. 2 gets a notification from 1 that it is leading but 2 doesn't accept it as a leader since it cannot find one other node that thinks 1 is the leader.

      So the ensemble gets stuck where 2 isn't following. If at this point, 3 comes up, then one of 2 & 3 will become a leader and 1 will keep thinking it is the leader.

      I am working on a patch to fix this issue.

      1. zk1075.txt
        2 kB
        Vishal Kathuria

        Activity

        Hide
        vishal.k Vishal Kathuria added a comment -

        The issue I think is this code below in FastLeaderElection.java
        /**

        • Before joining an established ensemble, verify that
        • a majority are following the same leader.
          */
          outofelection.put(n.sid, new Vote(n.leader, n.zxid,
          n.epoch, n.state));
          if (termPredicate(outofelection, new Vote(n.leader,
          n.zxid, n.epoch, n.state))
          && checkLeader(outofelection, n.leader, n.epoch)) {

        In the case above, there is only one entry in outofelection that does not constitute the majority. What we really need to check is whether outofelection.size() + 1(this server) forms a majority because once this server accepts the leader, the leader will have a majority of followers.

        Show
        vishal.k Vishal Kathuria added a comment - The issue I think is this code below in FastLeaderElection.java /** Before joining an established ensemble, verify that a majority are following the same leader. */ outofelection.put(n.sid, new Vote(n.leader, n.zxid, n.epoch, n.state)); if (termPredicate(outofelection, new Vote(n.leader, n.zxid, n.epoch, n.state)) && checkLeader(outofelection, n.leader, n.epoch)) { In the case above, there is only one entry in outofelection that does not constitute the majority. What we really need to check is whether outofelection.size() + 1(this server) forms a majority because once this server accepts the leader, the leader will have a majority of followers.
        Hide
        vishalmlst Vishal Kher added a comment -

        Hi Vishal,

        Can you please attach logs? Also, it might be good to use the trunk since there have been some relevant code changes (https://issues.apache.org/jira/browse/ZOOKEEPER-975).

        Thanks,
        Vishal

        Show
        vishalmlst Vishal Kher added a comment - Hi Vishal, Can you please attach logs? Also, it might be good to use the trunk since there have been some relevant code changes ( https://issues.apache.org/jira/browse/ZOOKEEPER-975 ). Thanks, Vishal
        Hide
        vishal.k Vishal Kathuria added a comment -

        I am syncing from the trunk and trying it out.
        This issue looks different from the one in 975 - in this issue, we never reach convergence.

        I lost the logs as I reran the server after implementing my fix. I'll upload the logs after rerunning it.

        Thanks!

        Show
        vishal.k Vishal Kathuria added a comment - I am syncing from the trunk and trying it out. This issue looks different from the one in 975 - in this issue, we never reach convergence. I lost the logs as I reran the server after implementing my fix. I'll upload the logs after rerunning it. Thanks!
        Hide
        fpj Flavio Junqueira added a comment -

        Hi Vishal, In the scenario you're describing, it sounds like 1 will eventually release leadership (it doesn't have enough supporters), and in that case both 1 and 2 are looking for a leader and a leader is elected regularly, no?

        Show
        fpj Flavio Junqueira added a comment - Hi Vishal, In the scenario you're describing, it sounds like 1 will eventually release leadership (it doesn't have enough supporters), and in that case both 1 and 2 are looking for a leader and a leader is elected regularly, no?
        Hide
        vishal.k Vishal Kathuria added a comment -

        I did not see 1 relinquish the leadership when I ran the repro. 1 continued to think it was the leader and 2 continued to refuse accepting that and they stayed in this state indefinitely (over an hour I think)

        Show
        vishal.k Vishal Kathuria added a comment - I did not see 1 relinquish the leadership when I ran the repro. 1 continued to think it was the leader and 2 continued to refuse accepting that and they stayed in this state indefinitely (over an hour I think)
        Hide
        fpj Flavio Junqueira added a comment -

        Could you post your configuration here, please? It sounds awkward that your leader was hanging there for over an hour without support for a quorum. It should have timed out.

        Show
        fpj Flavio Junqueira added a comment - Could you post your configuration here, please? It sounds awkward that your leader was hanging there for over an hour without support for a quorum. It should have timed out.
        Hide
        vishal.k Vishal Kathuria added a comment -

        ah, that makes sense. I was using high timeouts because I was debugging some other issue. It doesn't occur with low timeouts (and I am embarrassed it didn't occur to me .

        I am attaching the patch anyway if you folks think it is a nice optimization - this way the node can join right away instead of having to wait for the timeout and start a new election.

        I'll close the Jira.

        Show
        vishal.k Vishal Kathuria added a comment - ah, that makes sense. I was using high timeouts because I was debugging some other issue. It doesn't occur with low timeouts (and I am embarrassed it didn't occur to me . I am attaching the patch anyway if you folks think it is a nice optimization - this way the node can join right away instead of having to wait for the timeout and start a new election. I'll close the Jira.
        Hide
        vishal.k Vishal Kathuria added a comment -

        This is a simple patch to optimize the case when a server that would restore the quorum is coming up.

        Instead of leader giving up and leader election happening, this change will make the new server join the existing ensemble.

        Show
        vishal.k Vishal Kathuria added a comment - This is a simple patch to optimize the case when a server that would restore the quorum is coming up. Instead of leader giving up and leader election happening, this change will make the new server join the existing ensemble.
        Hide
        fpj Flavio Junqueira added a comment -

        Heh, no worries, Vishal. It is good already that you considered contributing.

        Show
        fpj Flavio Junqueira added a comment - Heh, no worries, Vishal. It is good already that you considered contributing.

          People

          • Assignee:
            Unassigned
            Reporter:
            vishal.k Vishal Kathuria
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 48h
              48h
              Remaining:
              Remaining Estimate - 48h
              48h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development