ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1465

Cluster availability following new leader election takes a long time with large datasets - is correlated to dataset size

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 3.4.3
    • Fix Version/s: 3.4.4, 3.5.0
    • Component/s: leaderElection
    • Labels:
      None

      Description

      When re-electing a new leader of a cluster, it takes a long time for the cluster to become available if the dataset is large

      Test Data
      ----------
      650mb snapshot size
      20k nodes of varied size
      3 member cluster

      On 3.4.x branch (http://svn.apache.org/repos/asf/zookeeper/branches/branch-3.4?r=1244779)
      ------------------------------------------------------------------------------------------

      Takes 3-4 minutes to bring up a cluster from cold
      Takes 40-50 secs to recover from a leader failure
      Takes 10 secs for a new follower to join the cluster

      Using the 3.3.5 release on the same hardware with the same dataset
      -----------------------------------------------------------------

      Takes 10-20 secs to bring up a cluster from cold
      Takes 10 secs to recover from a leader failure
      Takes 10 secs for a new follower to join the cluster

      I can see from the logs in 3.4.x that once a new leader is elected, it pushes a new snapshot to each of the followers who need to save it before they ack the leader who can then mark the cluster as available.

      The kit being used is a low spec vm so the times taken are not relevant per se - more the fact that a snapshot is always sent even through there is no difference between the persisted state on each peer.
      No data is being added to the cluster while the peers are being restarted.

      1. ZOOKEEPER-1465.patch
        2 kB
        Camille Fournier
      2. ZOOKEEPER-1465.patch
        6 kB
        Thawan Kooburat
      3. ZOOKEEPER-1465.patch
        13 kB
        Flavio Junqueira
      4. ZOOKEEPER-1465.patch
        13 kB
        Flavio Junqueira
      5. ZOOKEEPER-1465_br34.patch
        14 kB
        Camille Fournier

        Activity

        Hide
        Camille Fournier added a comment -

        When there are no uncommitted proposals and no diff between leader and follower, leader will send a snap due to the fact that this logic:

        if (peerLastZxid == leaderLastZxid) {
        LOG.debug("Leader and follower are in sync, sending empty diff. zxid=0x{}",
        Long.toHexString(leaderLastZxid));
        // We are in sync so we'll do an empty diff
        packetToSend = Leader.DIFF;
        zxidToSend = leaderLastZxid;
        }

        Will never return true. I'm trying to figure out a test for this now.

        Show
        Camille Fournier added a comment - When there are no uncommitted proposals and no diff between leader and follower, leader will send a snap due to the fact that this logic: if (peerLastZxid == leaderLastZxid) { LOG.debug("Leader and follower are in sync, sending empty diff. zxid=0x{}", Long.toHexString(leaderLastZxid)); // We are in sync so we'll do an empty diff packetToSend = Leader.DIFF; zxidToSend = leaderLastZxid; } Will never return true. I'm trying to figure out a test for this now.
        Hide
        Camille Fournier added a comment -

        One basic attempt. The Zab1_0 tests should be changed to reflect that we send DIFFs not SNAP. Might want additional tests though.

        Show
        Camille Fournier added a comment - One basic attempt. The Zab1_0 tests should be changed to reflect that we send DIFFs not SNAP. Might want additional tests though.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12528348/ZOOKEEPER-1465.patch
        against trunk revision 1337029.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1082//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1082//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1082//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12528348/ZOOKEEPER-1465.patch against trunk revision 1337029. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1082//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1082//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1082//console This message is automatically generated.
        Hide
        Marshall McMullen added a comment -

        How confident are we in the fix for this bug? I'm patching our local zookeeper and we've definitely observed this bug ourselves. Would like to incorporate it into our next patched version but would love some feedback on this fix...Thanks!

        Show
        Marshall McMullen added a comment - How confident are we in the fix for this bug? I'm patching our local zookeeper and we've definitely observed this bug ourselves. Would like to incorporate it into our next patched version but would love some feedback on this fix...Thanks!
        Hide
        Camille Fournier added a comment -

        I think it's pretty decent, but it would be good if someone else double-checked my work. Ben or Flavio maybe, do either of you have time to check this out?
        It'd also be great if you tried running this in staging and see if it helps Marshall, and I would appreciate your eyes on the code if you want to take a look.

        Show
        Camille Fournier added a comment - I think it's pretty decent, but it would be good if someone else double-checked my work. Ben or Flavio maybe, do either of you have time to check this out? It'd also be great if you tried running this in staging and see if it helps Marshall, and I would appreciate your eyes on the code if you want to take a look.
        Hide
        Flavio Junqueira added a comment -

        I'll have a look into it.

        Show
        Flavio Junqueira added a comment - I'll have a look into it.
        Hide
        Thawan Kooburat added a comment -

        Even if there are committed proposal but learner is already in-sync, we still have bug due to this line
        // skip the proposals the peer already has
        if (propose.packet.getZxid() <= peerLastZxid)

        { prevProposalZxid = propose.packet.getZxid(); continue; }

        If this is the last proposal, it will exit the loop and never setup the DIFF packet. So it will send out the entire snapshot.

        This patch seems to work in this case as well.

        Show
        Thawan Kooburat added a comment - Even if there are committed proposal but learner is already in-sync, we still have bug due to this line // skip the proposals the peer already has if (propose.packet.getZxid() <= peerLastZxid) { prevProposalZxid = propose.packet.getZxid(); continue; } If this is the last proposal, it will exit the loop and never setup the DIFF packet. So it will send out the entire snapshot. This patch seems to work in this case as well.
        Hide
        Flavio Junqueira added a comment -

        Here is a summary of my progress so far. The patch essentially proposes this change:

        -                if (peerLastZxid == leaderLastZxid) {
        +                if (peerLastZxid == leaderLastZxid || peerLastZxid == maxCommittedLog) {
        

        This tells me that the problem is that we need to set packetToSend to Leader.DIFF when "peerLastZxid == maxCommittedLog", but "peerLastZxid == leaderLastZxid" does not hold and packetToSend is set to Leader.SNAP. If peerLastZxid <> leaderLastZxid, then the leader possibly has proposals that the learner doesn't, but I can't convince myself that it must be the case. If it does have proposals that the learner doesn't, then packetToSend should have been set to DIFF in the previous if/for block, no?

        I'll keep looking, but if anyone sees a flaw here or has extra information that might help, please let me know.

        Show
        Flavio Junqueira added a comment - Here is a summary of my progress so far. The patch essentially proposes this change: - if (peerLastZxid == leaderLastZxid) { + if (peerLastZxid == leaderLastZxid || peerLastZxid == maxCommittedLog) { This tells me that the problem is that we need to set packetToSend to Leader.DIFF when "peerLastZxid == maxCommittedLog", but "peerLastZxid == leaderLastZxid" does not hold and packetToSend is set to Leader.SNAP. If peerLastZxid <> leaderLastZxid, then the leader possibly has proposals that the learner doesn't, but I can't convince myself that it must be the case. If it does have proposals that the learner doesn't, then packetToSend should have been set to DIFF in the previous if/for block, no? I'll keep looking, but if anyone sees a flaw here or has extra information that might help, please let me know.
        Hide
        Thawan Kooburat added a comment -

        Here is my current understanding of the problem. Essentially, the leader uses 3 lists to sync with follower: committedLog, toBeApplied and outstandingProposals.

        What I believe that the existing logic intend to do is that, if follower missed the committedLog (either committedLog is empty or peerLastZxid is not in range), then we send snapshot and follow by transactions in toBeApplied and outstandingProposals.

        However, the problem that we see here is that the follower doesn’t missed the committedLog, but the logic below fail to setup DIFF packet correctly because peerLastZxid == maxCommittedLog (last element in commitedLog).

                                for (Proposal propose: proposals) {
                                    // skip the proposals the peer already has
                                    if (propose.packet.getZxid() <= peerLastZxid) {
                                        prevProposalZxid = propose.packet.getZxid();
                                        continue;
                                    } else {
                                        // If we are sending the first packet, figure out whether to trunc
                                        // in case the follower has some proposals that the leader doesn't
                                        if (firstPacket) {
                                            firstPacket = false;
                                            // Does the peer have some proposals that the leader hasn't seen yet
                                            if (prevProposalZxid < peerLastZxid) {
                                                // send a trunc message before sending the diff
                                                packetToSend = Leader.TRUNC;
                                                LOG.info("Sending TRUNC");
                                                zxidToSend = prevProposalZxid;
                                                updates = zxidToSend;
                                            }
                                            else {
                                                // Just send the diff
                                                packetToSend = Leader.DIFF;
                                                LOG.info("Sending diff");
                                                zxidToSend = maxCommittedLog;
                                            }
        
                                        }
                                        queuePacket(propose.packet);
                                        QuorumPacket qcommit = new QuorumPacket(Leader.COMMIT, propose.packet.getZxid(),
                                                null, null);
                                        queuePacket(qcommit);
                                    }
                                }
        

        I believe that if we fix the problem at the root cause, we can remove the code below completely since deciding whether to send DIFF packet is based on the fact that follower miss the committedLog or not. The startForwarding() method should handle inflight transactions correctly.

                        if (peerLastZxid == leaderLastZxid) {
                            LOG.debug("Leader and follower are in sync, sending empty diff. zxid=0x{}",
                                    Long.toHexString(leaderLastZxid));
                            // We are in sync so we'll do an empty diff
                            packetToSend = Leader.DIFF;
                            zxidToSend = leaderLastZxid;
                        }
        

        The proposed fix minimizes code changes, but should we fix the problem at the root?

        Show
        Thawan Kooburat added a comment - Here is my current understanding of the problem. Essentially, the leader uses 3 lists to sync with follower: committedLog, toBeApplied and outstandingProposals. What I believe that the existing logic intend to do is that, if follower missed the committedLog (either committedLog is empty or peerLastZxid is not in range), then we send snapshot and follow by transactions in toBeApplied and outstandingProposals. However, the problem that we see here is that the follower doesn’t missed the committedLog, but the logic below fail to setup DIFF packet correctly because peerLastZxid == maxCommittedLog (last element in commitedLog). for (Proposal propose: proposals) { // skip the proposals the peer already has if (propose.packet.getZxid() <= peerLastZxid) { prevProposalZxid = propose.packet.getZxid(); continue; } else { // If we are sending the first packet, figure out whether to trunc // in case the follower has some proposals that the leader doesn't if (firstPacket) { firstPacket = false; // Does the peer have some proposals that the leader hasn't seen yet if (prevProposalZxid < peerLastZxid) { // send a trunc message before sending the diff packetToSend = Leader.TRUNC; LOG.info("Sending TRUNC"); zxidToSend = prevProposalZxid; updates = zxidToSend; } else { // Just send the diff packetToSend = Leader.DIFF; LOG.info("Sending diff"); zxidToSend = maxCommittedLog; } } queuePacket(propose.packet); QuorumPacket qcommit = new QuorumPacket(Leader.COMMIT, propose.packet.getZxid(), null, null); queuePacket(qcommit); } } I believe that if we fix the problem at the root cause, we can remove the code below completely since deciding whether to send DIFF packet is based on the fact that follower miss the committedLog or not. The startForwarding() method should handle inflight transactions correctly. if (peerLastZxid == leaderLastZxid) { LOG.debug("Leader and follower are in sync, sending empty diff. zxid=0x{}", Long.toHexString(leaderLastZxid)); // We are in sync so we'll do an empty diff packetToSend = Leader.DIFF; zxidToSend = leaderLastZxid; } The proposed fix minimizes code changes, but should we fix the problem at the root?
        Hide
        Flavio Junqueira added a comment -

        I was wondering the same thing, if that "if" block is needed in the case the decision logic for SNAP, DIFF, TRUNC is correct. Once we determine that "(maxCommittedLog >= peerLastZxid) && (minCommittedLog <= peerLastZxid)" holds, then we know that it is either DIFF or TRUNC, so we should not leave that if block with packetToSend = Leader.SNAP. If we fix that, I believe that we will be fixing the issue of this jira. Does it sound right?

        I also trying to think if there is a case in which "proposals.size() == 0" and we end up sending a snapshot unnecessarily. "proposals" will be empty if the log is empty, but I think we could have the situation in which the leader and the follower have the same snapshot, but the transaction log of the leader has been pruned. Does it make sense? If it does, then we need to keep that last block.

        Show
        Flavio Junqueira added a comment - I was wondering the same thing, if that "if" block is needed in the case the decision logic for SNAP, DIFF, TRUNC is correct. Once we determine that "(maxCommittedLog >= peerLastZxid) && (minCommittedLog <= peerLastZxid)" holds, then we know that it is either DIFF or TRUNC, so we should not leave that if block with packetToSend = Leader.SNAP. If we fix that, I believe that we will be fixing the issue of this jira. Does it sound right? I also trying to think if there is a case in which "proposals.size() == 0" and we end up sending a snapshot unnecessarily. "proposals" will be empty if the log is empty, but I think we could have the situation in which the leader and the follower have the same snapshot, but the transaction log of the leader has been pruned. Does it make sense? If it does, then we need to keep that last block.
        Hide
        Flavio Junqueira added a comment -

        More about "proposals" being empty. It sounds like it can happen if dt.lastProcessedZxid is the last logged transaction. In this case, "listener.onTxnLoaded(hdr, itr.getTxn())" won't be called in "FileTxnSnapLog.restore()" and "ZKDatabase.addCommittedProposal()" won't be invoked, leaving the committedLog empty upon a restore. Recall that "proposals = leader.zk.getZKDatabase().getCommittedLog()".

        If this is right, then I'm actually thinking that there might be a corner case to this change:

        +                if (peerLastZxid == leaderLastZxid || peerLastZxid == maxCommittedLog) {
        

        Suppose that:

        1. peerLastZxid == 0;
        2. committed log is empty (maxCommittedLog==0);
        3. leaderLastZxid > 0 and leader has a snapshot to send.

        If this case can happen, then the leader would fail to send the snapshot. Do you see a problem with the scenario I'm suggesting, Thawan, Camille?

        Show
        Flavio Junqueira added a comment - More about "proposals" being empty. It sounds like it can happen if dt.lastProcessedZxid is the last logged transaction. In this case, "listener.onTxnLoaded(hdr, itr.getTxn())" won't be called in "FileTxnSnapLog.restore()" and "ZKDatabase.addCommittedProposal()" won't be invoked, leaving the committedLog empty upon a restore. Recall that "proposals = leader.zk.getZKDatabase().getCommittedLog()". If this is right, then I'm actually thinking that there might be a corner case to this change: + if (peerLastZxid == leaderLastZxid || peerLastZxid == maxCommittedLog) { Suppose that: peerLastZxid == 0; committed log is empty (maxCommittedLog==0); leaderLastZxid > 0 and leader has a snapshot to send. If this case can happen, then the leader would fail to send the snapshot. Do you see a problem with the scenario I'm suggesting, Thawan, Camille?
        Hide
        Camille Fournier added a comment -

        Thawan, that's a good idea. It's not a bigger change to change the packetToSend=DIFF when we enter the if((maxCommittedLog >=peerLastZxid) && (minCommittedLog <= peerLastZxid)) instead of putting the hack where I stuck it. Do you want to modify my patch and see if it still works?

        Flavio, I'm not sure how we can have a snapshot to send when we don't have a committedLog. Can you clarify? Either the requests are committed, or they should be in the toBeApplied list on the Leader and so the peer will get them when startForwarding is called with the peer's last zxid.

        Show
        Camille Fournier added a comment - Thawan, that's a good idea. It's not a bigger change to change the packetToSend=DIFF when we enter the if((maxCommittedLog >=peerLastZxid) && (minCommittedLog <= peerLastZxid)) instead of putting the hack where I stuck it. Do you want to modify my patch and see if it still works? Flavio, I'm not sure how we can have a snapshot to send when we don't have a committedLog. Can you clarify? Either the requests are committed, or they should be in the toBeApplied list on the Leader and so the peer will get them when startForwarding is called with the peer's last zxid.
        Hide
        Flavio Junqueira added a comment -

        I must say that I haven't been able to convince myself completely that it can happen, but this is how I see it at the moment. The leader restores its database from a snapshot and the last transaction in the log is the same as the last processed, so committedLog is empty. If there are operations in flight when the learner handler code we are talking about is executed, then the ones in toBeApplied will be sent, but the learner might still need the snapshot.

        Show
        Flavio Junqueira added a comment - I must say that I haven't been able to convince myself completely that it can happen, but this is how I see it at the moment. The leader restores its database from a snapshot and the last transaction in the log is the same as the last processed, so committedLog is empty. If there are operations in flight when the learner handler code we are talking about is executed, then the ones in toBeApplied will be sent, but the learner might still need the snapshot.
        Hide
        Thawan Kooburat added a comment -

        My version of the patch

        Show
        Thawan Kooburat added a comment - My version of the patch
        Hide
        Thawan Kooburat added a comment -

        This introduces a lot more changes. I found that there are 2 conditions that we will incorrectly send a snap even though it is not necessary

        1. lastPeerZxid == maxCommittedLog
        2. committedLog is empty but both leader and follower db is already sync. This can happen when the leader restart after it just take a snapshot. Easy case to produce this is to restart the quorum 2 times in row.

        I am also working a patch ZOOKEEPER-1413 that will allow the leader to look back into on-disk txnlog, so that the need for doing snapshot transfer is further reduced.

        Show
        Thawan Kooburat added a comment - This introduces a lot more changes. I found that there are 2 conditions that we will incorrectly send a snap even though it is not necessary 1. lastPeerZxid == maxCommittedLog 2. committedLog is empty but both leader and follower db is already sync. This can happen when the leader restart after it just take a snapshot. Easy case to produce this is to restart the quorum 2 times in row. I am also working a patch ZOOKEEPER-1413 that will allow the leader to look back into on-disk txnlog, so that the need for doing snapshot transfer is further reduced.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12530600/ZOOKEEPER-1465.patch
        against trunk revision 1337029.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1085//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1085//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1085//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12530600/ZOOKEEPER-1465.patch against trunk revision 1337029. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1085//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1085//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1085//console This message is automatically generated.
        Hide
        Flavio Junqueira added a comment -

        Thanks, it looks good in general, Thawan. One question, if you remove this "if (peerLastZxid == leaderLastZxid)" block, then are we going to be sending a snapshot in the case that committedLog is empty (proposals is empty) and peerLastZxid == leaderLastZxid? leaderLastZxid is not necessarily equal to leader.zk.getZKDatabase().getDataTreeLastProcessedZxid().

        2. committedLog is empty but both leader and follower db is already sync. This can happen when the leader restart after it just take a snapshot. Easy case to produce this is to restart the quorum 2 times in row.

        This case is essentially the same I was pointing out above. if it is simple to reproduce, it might be a good idea to have a test.

        Show
        Flavio Junqueira added a comment - Thanks, it looks good in general, Thawan. One question, if you remove this "if (peerLastZxid == leaderLastZxid)" block, then are we going to be sending a snapshot in the case that committedLog is empty (proposals is empty) and peerLastZxid == leaderLastZxid? leaderLastZxid is not necessarily equal to leader.zk.getZKDatabase().getDataTreeLastProcessedZxid(). 2. committedLog is empty but both leader and follower db is already sync. This can happen when the leader restart after it just take a snapshot. Easy case to produce this is to restart the quorum 2 times in row. This case is essentially the same I was pointing out above. if it is simple to reproduce, it might be a good idea to have a test.
        Hide
        Flavio Junqueira added a comment -

        Cancelling until we sort out comments.

        Show
        Flavio Junqueira added a comment - Cancelling until we sort out comments.
        Hide
        Thawan Kooburat added a comment -

        This is my current understanding:
        leaderLastZxid is essentially lastProposed on the leader. (peerLastZxid == leaderLastZxid) will be true only if there is no in-flight txn on the leader when follower is syncing.

        As I explained earlier, the decision to decided whether to send a DIFF is based solely if follower miss the committedLog or not. Other case is just a special case that should be covered if we got that logic right. Please correct me if this assumption is not true.

        On the other hand, when committedLog is empty there is nothing we can do except the special case that I added in (2)

        Show
        Thawan Kooburat added a comment - This is my current understanding: leaderLastZxid is essentially lastProposed on the leader. (peerLastZxid == leaderLastZxid) will be true only if there is no in-flight txn on the leader when follower is syncing. As I explained earlier, the decision to decided whether to send a DIFF is based solely if follower miss the committedLog or not. Other case is just a special case that should be covered if we got that logic right. Please correct me if this assumption is not true. On the other hand, when committedLog is empty there is nothing we can do except the special case that I added in (2)
        Hide
        Flavio Junqueira added a comment - - edited

        I agree with this case, but I'm asking about this one:

        1. committedLog is empty
        2. leaderLastZxid > leader.zk.getZKDatabase().getDataTreeLastProcessedZxid()

        leaderLastZxid is lastProposed as you say, but it can be greater than leader.zk.getZKDatabase().getDataTreeLastProcessedZxid(), right?

        Show
        Flavio Junqueira added a comment - - edited I agree with this case, but I'm asking about this one: committedLog is empty leaderLastZxid > leader.zk.getZKDatabase().getDataTreeLastProcessedZxid() leaderLastZxid is lastProposed as you say, but it can be greater than leader.zk.getZKDatabase().getDataTreeLastProcessedZxid(), right?
        Hide
        Flavio Junqueira added a comment -

        About my previous comment, even though leaderLastZxid can be greater than leader.zk.getZKDatabase().getDataTreeLastProcessedZxid(), in the case that leaderLastZxid == peerLastZxid, it must be the case that leaderLastZxid == leader.zk.getZKDatabase().getDataTreeLastProcessedZxid().

        I believe we have this invariant that it can't be that "leaderLastZxid >= peerLastZxid > leader.zk.getZKDatabase().getDataTreeLastProcessedZxid()". The case that "leaderLastZxid >= peerLastZxid == leader.zk.getZKDatabase().getDataTreeLastProcessedZxid()" is allowed, though, but already covered in your patch.

        As I explained earlier, the decision to decided whether to send a DIFF is based solely if follower miss the committedLog or not. Other case is just a special case that should be covered if we got that logic right. Please correct me if this assumption is not true.

        The assumption is correct.

        About a test for your second case, do you think you can work on one? I can try to think of one as well.

        Show
        Flavio Junqueira added a comment - About my previous comment, even though leaderLastZxid can be greater than leader.zk.getZKDatabase().getDataTreeLastProcessedZxid(), in the case that leaderLastZxid == peerLastZxid, it must be the case that leaderLastZxid == leader.zk.getZKDatabase().getDataTreeLastProcessedZxid(). I believe we have this invariant that it can't be that "leaderLastZxid >= peerLastZxid > leader.zk.getZKDatabase().getDataTreeLastProcessedZxid()". The case that "leaderLastZxid >= peerLastZxid == leader.zk.getZKDatabase().getDataTreeLastProcessedZxid()" is allowed, though, but already covered in your patch. As I explained earlier, the decision to decided whether to send a DIFF is based solely if follower miss the committedLog or not. Other case is just a special case that should be covered if we got that logic right. Please correct me if this assumption is not true. The assumption is correct. About a test for your second case, do you think you can work on one? I can try to think of one as well.
        Hide
        Thawan Kooburat added a comment -

        Please feel free to add a test if you have time to work on it

        Show
        Thawan Kooburat added a comment - Please feel free to add a test if you have time to work on it
        Hide
        Flavio Junqueira added a comment -

        New patch adds a new test case as dicussed.

        Show
        Flavio Junqueira added a comment - New patch adds a new test case as dicussed.
        Hide
        Flavio Junqueira added a comment -

        I've added a test case to reproduce the case of commitedLog being empty and follower and leader being in sync. Please have a look and let me know if it works for you.

        Show
        Flavio Junqueira added a comment - I've added a test case to reproduce the case of commitedLog being empty and follower and leader being in sync. Please have a look and let me know if it works for you.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12532186/ZOOKEEPER-1465.patch
        against trunk revision 1337029.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1096//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1096//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1096//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12532186/ZOOKEEPER-1465.patch against trunk revision 1337029. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1096//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1096//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1096//console This message is automatically generated.
        Hide
        Flavio Junqueira added a comment -

        Improvements over the previous patch. The essence is still the same, though. I have simply generalized the method to populate the database.

        Show
        Flavio Junqueira added a comment - Improvements over the previous patch. The essence is still the same, though. I have simply generalized the method to populate the database.
        Hide
        Flavio Junqueira added a comment -

        Rerun Jenkins...

        Show
        Flavio Junqueira added a comment - Rerun Jenkins...
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12532193/ZOOKEEPER-1465.patch
        against trunk revision 1337029.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1097//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1097//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1097//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12532193/ZOOKEEPER-1465.patch against trunk revision 1337029. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1097//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1097//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1097//console This message is automatically generated.
        Hide
        Patrick Hunt added a comment -

        Where are we with this? Consensus is that it's ready or should we push it out to 3.4.5 to allow it to bake more? Anyone has tested this to any extent outside of the unit tests?

        Show
        Patrick Hunt added a comment - Where are we with this? Consensus is that it's ready or should we push it out to 3.4.5 to allow it to bake more? Anyone has tested this to any extent outside of the unit tests?
        Hide
        Flavio Junqueira added a comment -

        I'm not aware of anyone trying this in a real setting, but the issue is legitimate and we have verified both through reasoning and tests. It is +1 for me.

        Show
        Flavio Junqueira added a comment - I'm not aware of anyone trying this in a real setting, but the issue is legitimate and we have verified both through reasoning and tests. It is +1 for me.
        Hide
        Thawan Kooburat added a comment -

        lastProcessedZxid==peerZxid is quite a common issue that we see in our production. We are planning to deploy a variant of this patch in our production soon.

        Show
        Thawan Kooburat added a comment - lastProcessedZxid==peerZxid is quite a common issue that we see in our production. We are planning to deploy a variant of this patch in our production soon.
        Hide
        Flavio Junqueira added a comment -

        Thawan, Out of curiosity, why a variant and not this patch?

        Show
        Flavio Junqueira added a comment - Thawan, Out of curiosity, why a variant and not this patch?
        Hide
        Thawan Kooburat added a comment -

        Our internal version also use on-disk txnlog together with committedLog to sync with follower. So the logic for learner handler is our version is a bit different.

        Show
        Thawan Kooburat added a comment - Our internal version also use on-disk txnlog together with committedLog to sync with follower. So the logic for learner handler is our version is a bit different.
        Hide
        Camille Fournier added a comment -

        Now that I can finally get onto jira I'm going to run this in a small testing env I have here to make sure it looks good.

        Show
        Camille Fournier added a comment - Now that I can finally get onto jira I'm going to run this in a small testing env I have here to make sure it looks good.
        Hide
        Camille Fournier added a comment -

        So I'm seeing a log message that one of the followers is snapshotting twice in rapid succession, and I'm worried that it's caused by this patch. Tracking it down now.

        Show
        Camille Fournier added a comment - So I'm seeing a log message that one of the followers is snapshotting twice in rapid succession, and I'm worried that it's caused by this patch. Tracking it down now.
        Hide
        Camille Fournier added a comment -

        OK this appears to be an unrelated minor bug in the new Zab1.0 logic. Other than that, my testing looks good for this patch.

        Show
        Camille Fournier added a comment - OK this appears to be an unrelated minor bug in the new Zab1.0 logic. Other than that, my testing looks good for this patch.
        Hide
        Patrick Hunt added a comment -

        Camille you tried this with trunk or branch3.4? Could you verify it against branch3.4 as well? Thanks!

        Show
        Patrick Hunt added a comment - Camille you tried this with trunk or branch3.4? Could you verify it against branch3.4 as well? Thanks!
        Hide
        Camille Fournier added a comment -

        Actually just tried it with 3.4, didn't try trunk at all.

        Show
        Camille Fournier added a comment - Actually just tried it with 3.4, didn't try trunk at all.
        Hide
        Camille Fournier added a comment -

        OK I tried this with both trunk and 3.4 and it's looking pretty good to me. I am going to check this in today.

        Show
        Camille Fournier added a comment - OK I tried this with both trunk and 3.4 and it's looking pretty good to me. I am going to check this in today.
        Hide
        Hudson added a comment -

        Integrated in ZooKeeper-trunk #1604 (See https://builds.apache.org/job/ZooKeeper-trunk/1604/)
        ZOOKEEPER-1465. Cluster availability following new leader election
        takes a long time with large datasets - is correlated to dataset size
        (fpj and Thawan Kooburat via camille) (Revision 1357711)

        Result = SUCCESS
        camille : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1357711
        Files :

        • /zookeeper/trunk/CHANGES.txt
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/Leader.java
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java
        • /zookeeper/trunk/src/java/test/org/apache/zookeeper/server/quorum/Zab1_0Test.java
        Show
        Hudson added a comment - Integrated in ZooKeeper-trunk #1604 (See https://builds.apache.org/job/ZooKeeper-trunk/1604/ ) ZOOKEEPER-1465 . Cluster availability following new leader election takes a long time with large datasets - is correlated to dataset size (fpj and Thawan Kooburat via camille) (Revision 1357711) Result = SUCCESS camille : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1357711 Files : /zookeeper/trunk/CHANGES.txt /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/Leader.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/server/quorum/Zab1_0Test.java
        Hide
        Camille Fournier added a comment -

        Checked in to trunk and 3.4.

        Show
        Camille Fournier added a comment - Checked in to trunk and 3.4.
        Hide
        Camille Fournier added a comment -

        patch from the 3.4 branch

        Show
        Camille Fournier added a comment - patch from the 3.4 branch

          People

          • Assignee:
            Camille Fournier
            Reporter:
            Alex Gvozdenovic
          • Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development