ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-962

leader/follower coherence issue when follower is receiving a DIFF

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 3.3.2
    • Fix Version/s: 3.3.3, 3.4.0
    • Component/s: server
    • Labels:
      None

      Description

      From mailing list:
      It seems like we rely on the LearnerHandler thread startup to capture all of the missing committed
      transactions in the SNAP or DIFF, but I don't see anything (especially in the DIFF case) that
      is preventing us for committing more transactions before we actually start forwarding updates
      to the new follower.

      Let me explain using my example from ZOOKEEPER-919. Assume we have quorum already, so the
      leader can be processing transactions while my follower is starting up.

      I'm a follower at zxid N-5, the leader is at N. I send my FOLLOWERINFO packet to the leader
      with that information. The leader gets the proposals from its committed log (time T1), then
      syncs on the proposal list (LearnerHandler line 267. Why? It's a copy of the underlying proposal
      list... this might be part of our problem). I check to see if the peerLastZxid is within my
      max and min committed log and it is, so I'm going to send a diff. I set the zxidToSend to
      be the maxCommittedLog at time T3 (we already know this is sketchy), and forward the proposals
      from my copied proposal list starting at the peerLastZxid+1 up to the last proposal transaction
      (as seen at time T1).

      After I have queued up all those diffs to send, I tell the leader to startFowarding updates
      to this follower (line 308).

      So, let's say that at time T2 I actually swap out the leader to the thread that is handling
      the various request processors, and see that I got enough votes to commit zxid N+1. I commit
      N+1 and so my maxCommittedLog at T3 is N+1, but this proposal is not in the list of proposals
      that I got back at time T1, so I don't forward this diff to the client. Additionally, I processed
      the commit and removed it from my leader's toBeApplied list. So when I call startForwarding
      for this new follower, I don't see this transaction as a transaction to be forwarded.

      There's one problem. Let's also imagine, however, that I commit N+1 at time T4. The maxCommittedLog
      value is consistent with the max of the diff packets I am going to send the follower. But,
      I still committed N+1 and removed it from the toBeApplied list before calling startFowarding
      with this follower. How does the follower get this transaction? Does it?

      To put it another way, here is the thread interaction, hopefully formatted so you can read
      it...

      LearnerHandlerThread RequestProcessorThread
      T1(LH): get list of proposals (COPY)
      T2(RPT): commit N+1, remove from toBeApplied
      T3(LH): get maxCommittedLog
      T4(LH): send diffs from view at T1
      T5(LH): startForwarding

      Or
      T1(LH): get list of proposals (COPY)
      T2(LH): get maxCommittedLog
      T3(RPT): commit N+1, remove from toBeApplied
      T4(LH): send diffs from view at T1
      T5(LH): startFowarding

      I'm trying to figure out what, if anything, keeps the requests from being committed, removed,
      and never seen by the follower before it fully starts up.

      1. ZOOKEEPER-962_2.patch
        23 kB
        Benjamin Reed
      2. ZOOKEEPER-962_3.patch
        30 kB
        Camille Fournier
      3. ZOOKEEPER-962_4.patch
        39 kB
        Benjamin Reed
      4. ZOOKEEPER-962_5.patch
        34 kB
        Camille Fournier
      5. ZOOKEEPER-962_6.patch
        42 kB
        Benjamin Reed
      6. ZOOKEEPER-962.patch
        29 kB
        Camille Fournier

        Issue Links

          Activity

          Hide
          Mahadev konar added a comment -

          this should go into 3.3.

          Show
          Mahadev konar added a comment - this should go into 3.3.
          Hide
          Vishal Kher added a comment -

          Hi,

          May I please request an expected date for this fix? We are working on releasing our product in a month and I think this is a serious enough bug that might block our release. It will be very helpful to know when the fix is coming.

          Thanks for your help.

          Show
          Vishal Kher added a comment - Hi, May I please request an expected date for this fix? We are working on releasing our product in a month and I think this is a serious enough bug that might block our release. It will be very helpful to know when the fix is coming. Thanks for your help.
          Hide
          Camille Fournier added a comment -

          I think the fix for this will be twofold:
          First, I think the work that is currently done in LearnerHandler.run to set up packets to forward or not, should be done in leader.startForwarding. This method is synchronized with processAck, so we can feel confident that no new committed proposals should be written to the zk database while we are forwarding outstanding requests to the new follower[1]. We also need to add a proper lock for the committed log, so that we can ensure proper behavior for the proposal list processing for DIFFs.

          In the follower UPTODATE handling, I'm going to try checking for fzk.commitProcessor.committedRequest.isEmpty(), and yield until it is true, before calling fzk.takeSnapshot().

          I'm working on a patch for this now, let me know if you think this approach sounds fishy.

          [1] Still figuring out the sync issues when we have already received an ack and are in the commitProcessor/etc thread stack and simultaneously processing a startForwarding request.

          Show
          Camille Fournier added a comment - I think the fix for this will be twofold: First, I think the work that is currently done in LearnerHandler.run to set up packets to forward or not, should be done in leader.startForwarding. This method is synchronized with processAck, so we can feel confident that no new committed proposals should be written to the zk database while we are forwarding outstanding requests to the new follower [1] . We also need to add a proper lock for the committed log, so that we can ensure proper behavior for the proposal list processing for DIFFs. In the follower UPTODATE handling, I'm going to try checking for fzk.commitProcessor.committedRequest.isEmpty(), and yield until it is true, before calling fzk.takeSnapshot(). I'm working on a patch for this now, let me know if you think this approach sounds fishy. [1] Still figuring out the sync issues when we have already received an ack and are in the commitProcessor/etc thread stack and simultaneously processing a startForwarding request.
          Hide
          Camille Fournier added a comment -

          I think in my fixing/testing of this I have hit pretty much every single possible tiny threading bug in the follower resync code, including [1] above (ouch). I am almost done, pending patch approval for release, and I will hopefully have a patch by the end of today.

          Show
          Camille Fournier added a comment - I think in my fixing/testing of this I have hit pretty much every single possible tiny threading bug in the follower resync code, including [1] above (ouch). I am almost done, pending patch approval for release, and I will hopefully have a patch by the end of today.
          Hide
          Benjamin Reed added a comment -

          that's great camille. i don't think the strategy of waiting for committedRequest to be empty will work because it may never be empty: you may have a steady stream of commits coming in.

          i was thinking perhaps we should process the proposals synchronously in the syncWithLeader method and only exit once we get the update. i think this will take care of 919 as well.

          the easiest idea would be to apply the transactions to memory, and then on the UPDATE message, take a snapshot to disk synchronously and ack all of the previous transactions received and then exit the method.

          have you been able to generate a testcase for the problem?

          Show
          Benjamin Reed added a comment - that's great camille. i don't think the strategy of waiting for committedRequest to be empty will work because it may never be empty: you may have a steady stream of commits coming in. i was thinking perhaps we should process the proposals synchronously in the syncWithLeader method and only exit once we get the update. i think this will take care of 919 as well. the easiest idea would be to apply the transactions to memory, and then on the UPDATE message, take a snapshot to disk synchronously and ack all of the previous transactions received and then exit the method. have you been able to generate a testcase for the problem?
          Hide
          Camille Fournier added a comment -

          Waiting for committedRequests to be empty gives correct behavior (actually, you have to wait for all the queues in commit processor to be empty) , the thread that sends commits to the follower is the same as the thread that sends UPTODATE and the same thread processes commit requests as the UPTODATE. But as soon as I can give you this patch, you'll have to take a look and tell me what you think.

          I have testcases, yes. They are painful and due to the nature of the bug they don't always fail even if the bug exists, but I've run them enough locally to be relatively confident that they cover the error well enough that my fixes are at least correct, if not perfect. My actual fixes may not be ideal, but at least the tests should allow people to try other approaches and see how they work.

          Show
          Camille Fournier added a comment - Waiting for committedRequests to be empty gives correct behavior (actually, you have to wait for all the queues in commit processor to be empty) , the thread that sends commits to the follower is the same as the thread that sends UPTODATE and the same thread processes commit requests as the UPTODATE. But as soon as I can give you this patch, you'll have to take a look and tell me what you think. I have testcases, yes. They are painful and due to the nature of the bug they don't always fail even if the bug exists, but I've run them enough locally to be relatively confident that they cover the error well enough that my fixes are at least correct, if not perfect. My actual fixes may not be ideal, but at least the tests should allow people to try other approaches and see how they work.
          Hide
          Camille Fournier added a comment -

          Patch to fix this

          Show
          Camille Fournier added a comment - Patch to fix this
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12467594/ZOOKEEPER-962.patch
          against trunk revision 1053497.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 10 new Findbugs (version 1.3.9) warnings.

          -1 release audit. The applied patch generated 25 release audit warnings (more than the trunk's current 24 warnings).

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/81//testReport/
          Release audit warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/81//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/81//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/81//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12467594/ZOOKEEPER-962.patch against trunk revision 1053497. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 10 new Findbugs (version 1.3.9) warnings. -1 release audit. The applied patch generated 25 release audit warnings (more than the trunk's current 24 warnings). -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/81//testReport/ Release audit warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/81//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/81//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/81//console This message is automatically generated.
          Hide
          Camille Fournier added a comment -

          fix accidental deletion from quorumutil

          Show
          Camille Fournier added a comment - fix accidental deletion from quorumutil
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12467595/ZOOKEEPER-962.patch
          against trunk revision 1053497.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 10 new Findbugs (version 1.3.9) warnings.

          -1 release audit. The applied patch generated 25 release audit warnings (more than the trunk's current 24 warnings).

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/82//testReport/
          Release audit warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/82//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/82//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/82//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12467595/ZOOKEEPER-962.patch against trunk revision 1053497. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 10 new Findbugs (version 1.3.9) warnings. -1 release audit. The applied patch generated 25 release audit warnings (more than the trunk's current 24 warnings). +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/82//testReport/ Release audit warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/82//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/82//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/82//console This message is automatically generated.
          Hide
          Benjamin Reed added a comment -

          i gave reviewboard a try: https://reviews.apache.org/r/264/ there are a couple of minor points there
          one thing i'm not quite clear on is how moving to a read/write lock helps.

          i think there is another problem that i think is more related to 919, but since both issues hit them same piece of code, it would probably be good to fix them together. in the case of SNAP, we send a snapshot and then we send diffs to that snapshot, but imagine the following: we start taking a snapshot, S, and while we take the snapshot Z1, Z2, Z3, and Z4 come in; let's say that the snapshot finishes right after Z4. the follower will get S, and then it will get Z1, which it will log. if the follower, fails and comes back up, it will believe that it has the state up to Z1, which it does not infact have.

          to avoid this problem, i think we should receive S into memory, save off Z1-Z4 into a log in memory, committing them to S if we get a commit, and then when we get an UPTODATE we write out the snapshot and Z1-Z4, and ack Z1-Z4. we can do all this in syncWithLeader and not only the issue i'm pointing out here will be fixed, but also the concurrency issue that you have identified here will go away.

          what do you think?

          Show
          Benjamin Reed added a comment - i gave reviewboard a try: https://reviews.apache.org/r/264/ there are a couple of minor points there one thing i'm not quite clear on is how moving to a read/write lock helps. i think there is another problem that i think is more related to 919, but since both issues hit them same piece of code, it would probably be good to fix them together. in the case of SNAP, we send a snapshot and then we send diffs to that snapshot, but imagine the following: we start taking a snapshot, S, and while we take the snapshot Z1, Z2, Z3, and Z4 come in; let's say that the snapshot finishes right after Z4. the follower will get S, and then it will get Z1, which it will log. if the follower, fails and comes back up, it will believe that it has the state up to Z1, which it does not infact have. to avoid this problem, i think we should receive S into memory, save off Z1-Z4 into a log in memory, committing them to S if we get a commit, and then when we get an UPTODATE we write out the snapshot and Z1-Z4, and ack Z1-Z4. we can do all this in syncWithLeader and not only the issue i'm pointing out here will be fixed, but also the concurrency issue that you have identified here will go away. what do you think?
          Hide
          Camille Fournier added a comment -

          I made a reviewboard for this myself acutally, https://reviews.apache.org/r/253/, I guess it didn't get emailed to the group properly? I'll cancle it and use yours since we've already got comments in there.

          Let me see if I understand what you're getting at:
          We see snap to Z0. We start to see the toBeApplied proposals being forwarded, and we write them to the log, but before we get the UPTODATE (and thus write the snap file to disk), we crash. So we have a log file with Z1, but no snap file with the older data.

          Yes, you're right, that's a nasty problem.

          Are you proposing for all packets to UPTODATE in all syncWithLeader scenarios, we process them inside SyncWithLeader, or just the ones for snap?

          I think I can twiddle my test to catch this error, let me look at it. I'm not sure how much time I will have to actually make the fix you are proposing, though, so if you have time to try to add it onto my patch please let me know.

          Show
          Camille Fournier added a comment - I made a reviewboard for this myself acutally, https://reviews.apache.org/r/253/ , I guess it didn't get emailed to the group properly? I'll cancle it and use yours since we've already got comments in there. Let me see if I understand what you're getting at: We see snap to Z0. We start to see the toBeApplied proposals being forwarded, and we write them to the log, but before we get the UPTODATE (and thus write the snap file to disk), we crash. So we have a log file with Z1, but no snap file with the older data. Yes, you're right, that's a nasty problem. Are you proposing for all packets to UPTODATE in all syncWithLeader scenarios, we process them inside SyncWithLeader, or just the ones for snap? I think I can twiddle my test to catch this error, let me look at it. I'm not sure how much time I will have to actually make the fix you are proposing, though, so if you have time to try to add it onto my patch please let me know.
          Hide
          Camille Fournier added a comment -

          I may have overestimated my ability to write even an occasionally-failing test on this one.

          Do you think instead of trying to process all those PROPOSE/COMMITs inside the syncWithLeader loop we should just have the follower take a snapshot immediately after receiving the snapshot in syncWithLeader? That would probably be much easier than trying to do an in-memory log in syncWithLeader and should still provide the same guarantee, at the cost of doing a second snapshot soon after the first.

          Show
          Camille Fournier added a comment - I may have overestimated my ability to write even an occasionally-failing test on this one. Do you think instead of trying to process all those PROPOSE/COMMITs inside the syncWithLeader loop we should just have the follower take a snapshot immediately after receiving the snapshot in syncWithLeader? That would probably be much easier than trying to do an in-memory log in syncWithLeader and should still provide the same guarantee, at the cost of doing a second snapshot soon after the first.
          Hide
          Benjamin Reed added a comment -

          yes, you understand the scenario correctly. it is a tricky error case to test for. (i've been trying to write a test case

          we don't really have to do an in-memory log in syncWithLeader, we can actually just apply the the transactions to the in-memory database, ack when we write out the database. hopefully i'll have time today to modify your patch.

          Show
          Benjamin Reed added a comment - yes, you understand the scenario correctly. it is a tricky error case to test for. (i've been trying to write a test case we don't really have to do an in-memory log in syncWithLeader, we can actually just apply the the transactions to the in-memory database, ack when we write out the database. hopefully i'll have time today to modify your patch.
          Hide
          Benjamin Reed added a comment -

          here is an updated patch. it does the initial processing in the startup thread before anything else is started up so that it avoids any race conditions.

          i had to revert the part that moves everything to startFollowing because it does a heavy weight operation (snapshot) in a synchronized block that will hang the server if the database is very big. we still need to reexamine the locking.

          Show
          Benjamin Reed added a comment - here is an updated patch. it does the initial processing in the startup thread before anything else is started up so that it avoids any race conditions. i had to revert the part that moves everything to startFollowing because it does a heavy weight operation (snapshot) in a synchronized block that will hang the server if the database is very big. we still need to reexamine the locking.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12468113/ZOOKEEPER-962_2.patch
          against trunk revision 1055924.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 9 new Findbugs (version 1.3.9) warnings.

          -1 release audit. The applied patch generated 25 release audit warnings (more than the trunk's current 24 warnings).

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/86//testReport/
          Release audit warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/86//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/86//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/86//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12468113/ZOOKEEPER-962_2.patch against trunk revision 1055924. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 9 new Findbugs (version 1.3.9) warnings. -1 release audit. The applied patch generated 25 release audit warnings (more than the trunk's current 24 warnings). +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/86//testReport/ Release audit warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/86//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/86//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/86//console This message is automatically generated.
          Hide
          Camille Fournier added a comment -

          This patch breaks in my "concurrency test suite" which is just a suite that runs my FollowerResyncConcurrencyTest over and over again. I would expect it to break because if we don't sync the DIFF packet forwarding with the processing of acks we can remove acks from the toBeApplied list between when we forward DIFF packets and when we call startForwarding. I knew the synching was possibly excessive in the first version (looking back I'm not actually remembering why I decided to sync all of the calls together), but we need at least some additional synchronization.

          Show
          Camille Fournier added a comment - This patch breaks in my "concurrency test suite" which is just a suite that runs my FollowerResyncConcurrencyTest over and over again. I would expect it to break because if we don't sync the DIFF packet forwarding with the processing of acks we can remove acks from the toBeApplied list between when we forward DIFF packets and when we call startForwarding. I knew the synching was possibly excessive in the first version (looking back I'm not actually remembering why I decided to sync all of the calls together), but we need at least some additional synchronization.
          Hide
          Camille Fournier added a comment -

          One major bug in this patch: You don't send the actual dataTreeLastProcessedZxid when you send a SNAP packet, so the peer's snapshot file will have id 0 if no other commit packets come in after it rejoins (since it takes the zxid off the quorum packet SNAP).
          Changing LearnerHandler to set zxid in SNAP QP to dataTreeLastProcessedZxid and rerunning tests.

          Show
          Camille Fournier added a comment - One major bug in this patch: You don't send the actual dataTreeLastProcessedZxid when you send a SNAP packet, so the peer's snapshot file will have id 0 if no other commit packets come in after it rejoins (since it takes the zxid off the quorum packet SNAP). Changing LearnerHandler to set zxid in SNAP QP to dataTreeLastProcessedZxid and rerunning tests.
          Hide
          Camille Fournier added a comment -

          Also doesn't use my read lock on the proposal list. But I think I have a fix that will properly lock things in DIFF case, leave off the locking in SNAP case, and hopefully please all parties. Will send soon.

          Show
          Camille Fournier added a comment - Also doesn't use my read lock on the proposal list. But I think I have a fix that will properly lock things in DIFF case, leave off the locking in SNAP case, and hopefully please all parties. Will send soon.
          Hide
          Benjamin Reed added a comment -

          yes, i was just addressing the client side of the problem in my patch. we still need to fix the server.

          one thing that throws me: why isn't FollowerResyncConcurrencyTest failing on hudson or my machine?

          Show
          Benjamin Reed added a comment - yes, i was just addressing the client side of the problem in my patch. we still need to fix the server. one thing that throws me: why isn't FollowerResyncConcurrencyTest failing on hudson or my machine?
          Hide
          Camille Fournier added a comment -

          Ah cool. Still waiting for patch approval hopefully will give you a patch later today.

          That test is not the awesomest, due to the nature of the beast it will only fail transiently on errors. Also I just realized my tests are really not named well (right now the snap test in fact always creates diffs, and worthless diffs at that). I'm going to tweak it to try to make it more reliably failing.

          Show
          Camille Fournier added a comment - Ah cool. Still waiting for patch approval hopefully will give you a patch later today. That test is not the awesomest, due to the nature of the beast it will only fail transiently on errors. Also I just realized my tests are really not named well (right now the snap test in fact always creates diffs, and worthless diffs at that). I'm going to tweak it to try to make it more reliably failing.
          Hide
          Patrick Hunt added a comment -

          Perhaps more documentation could be added to the test method(s) to detail; 1) the intent of the test, and 2) any quirks - such as the test not being deterministic? (this goes for all tests, not just this one. Something we should do a better job on in general). Keep up the great work!

          Show
          Patrick Hunt added a comment - Perhaps more documentation could be added to the test method(s) to detail; 1) the intent of the test, and 2) any quirks - such as the test not being deterministic? (this goes for all tests, not just this one. Something we should do a better job on in general). Keep up the great work!
          Hide
          Camille Fournier added a comment -

          Attempt #3. Description on the tests, fixes to server side, small fix to Learner on top of Ben's patch

          Show
          Camille Fournier added a comment - Attempt #3. Description on the tests, fixes to server side, small fix to Learner on top of Ben's patch
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12468317/ZOOKEEPER-962_3.patch
          against trunk revision 1055924.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 9 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/87//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/87//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/87//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12468317/ZOOKEEPER-962_3.patch against trunk revision 1055924. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 9 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/87//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/87//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/87//console This message is automatically generated.
          Hide
          Vishal Kher added a comment -

          Hi Camille, Ben,

          I have one related question.

          There is a bug in the current implementation (without this patch) in code that sends SNAP to the follower only if the follower crashes while it is attempting to sync (as described by Ben earlier). Is my understanding correct?

          Thanks.

          Show
          Vishal Kher added a comment - Hi Camille, Ben, I have one related question. There is a bug in the current implementation (without this patch) in code that sends SNAP to the follower only if the follower crashes while it is attempting to sync (as described by Ben earlier). Is my understanding correct? Thanks.
          Hide
          Camille Fournier added a comment -

          Yes, only if the follower crashes before writing out the snap file and after writing a log file.

          Show
          Camille Fournier added a comment - Yes, only if the follower crashes before writing out the snap file and after writing a log file.
          Hide
          Camille Fournier added a comment -

          Any thoughts on the latest patch?

          Show
          Camille Fournier added a comment - Any thoughts on the latest patch?
          Hide
          Benjamin Reed added a comment -

          this looks great camille! i have very minor comments but it looks good to go. thanx for all your hard work! btw, the find bugs seems to be messed up. the bugs found don't seem to be related to the patch.

          my comments are in https://reviews.apache.org/r/333/

          i can fix the indentation if you want me to. i'd like you opinion on the first comment.

          Show
          Benjamin Reed added a comment - this looks great camille! i have very minor comments but it looks good to go. thanx for all your hard work! btw, the find bugs seems to be messed up. the bugs found don't seem to be related to the patch. my comments are in https://reviews.apache.org/r/333/ i can fix the indentation if you want me to. i'd like you opinion on the first comment.
          Hide
          Benjamin Reed added a comment -

          i believe this addresses the problems in ZOOKEEPER-919 as well. correct?

          Show
          Benjamin Reed added a comment - i believe this addresses the problems in ZOOKEEPER-919 as well. correct?
          Hide
          Camille Fournier added a comment -

          Yes this should fix 919.

          If you can fix the indentation that would be helpful thanks.

          Show
          Camille Fournier added a comment - Yes this should fix 919. If you can fix the indentation that would be helpful thanks.
          Hide
          Benjamin Reed added a comment -

          fixing comment and indentation

          Show
          Benjamin Reed added a comment - fixing comment and indentation
          Hide
          Camille Fournier added a comment -

          I think patch 4 is missing a small fix I added to Learner.syncWithLeader. We can't call logRequest on the zk in UPTODATE because we haven't actually started up the zookeeper yet, and you'll get a NPE if there are proposals that need to be logged.

          Show
          Camille Fournier added a comment - I think patch 4 is missing a small fix I added to Learner.syncWithLeader. We can't call logRequest on the zk in UPTODATE because we haven't actually started up the zookeeper yet, and you'll get a NPE if there are proposals that need to be logged.
          Hide
          Benjamin Reed added a comment -

          is that fix in patch 3? that is also calling logRequest in UPTODATE. right?

          Show
          Benjamin Reed added a comment - is that fix in patch 3? that is also calling logRequest in UPTODATE. right?
          Hide
          Camille Fournier added a comment -

          Hmmmm.... what happened... somehow I sent a version of this patch with some changes but not others? Let me resend, I have no idea how that happened.

          Show
          Camille Fournier added a comment - Hmmmm.... what happened... somehow I sent a version of this patch with some changes but not others? Let me resend, I have no idea how that happened.
          Hide
          Camille Fournier added a comment -

          Apologies, this contains the indent and comment fix, as well as better comments on the tests and the fix in Learner

          Show
          Camille Fournier added a comment - Apologies, this contains the indent and comment fix, as well as better comments on the tests and the fix in Learner
          Hide
          Camille Fournier added a comment -

          Once more without reformatting all of LearnerHandler

          Show
          Camille Fournier added a comment - Once more without reformatting all of LearnerHandler
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12468808/ZOOKEEPER-962_5.patch
          against trunk revision 1055924.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 9 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/101//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/101//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/101//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12468808/ZOOKEEPER-962_5.patch against trunk revision 1055924. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 9 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/101//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/101//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/101//console This message is automatically generated.
          Hide
          Benjamin Reed added a comment -

          sorry for dragging this out, but it really is important code.

          i think we need a lock around the code where we check for TRUNC and we start forwarding. right? we could run into problems if a commit happens between the calculation of TRUNC and the call to startForwarding.

          btw, one other favor, could you make your editor use spaces instead of tabs?

          Show
          Benjamin Reed added a comment - sorry for dragging this out, but it really is important code. i think we need a lock around the code where we check for TRUNC and we start forwarding. right? we could run into problems if a commit happens between the calculation of TRUNC and the call to startForwarding. btw, one other favor, could you make your editor use spaces instead of tabs?
          Hide
          Camille Fournier added a comment -

          re: tabs, sure, sorry about that.

          Will look at TRUNC. That's probably why I had all of this stuff in that sync'ed startForwarding in one of the earlier patches. I make no promises about my ability to write a test to repro it tho

          Show
          Camille Fournier added a comment - re: tabs, sure, sorry about that. Will look at TRUNC. That's probably why I had all of this stuff in that sync'ed startForwarding in one of the earlier patches. I make no promises about my ability to write a test to repro it tho
          Hide
          Benjamin Reed added a comment -

          yeah TRUNC almost never happens, so it would be hard to recreate

          Show
          Benjamin Reed added a comment - yeah TRUNC almost never happens, so it would be hard to recreate
          Hide
          Benjamin Reed added a comment -

          moves the TRUNC calculation into the lock block, which ends up simplifying the code. what do you think camille?

          i've also fixed the tabs to spaces.

          Show
          Benjamin Reed added a comment - moves the TRUNC calculation into the lock block, which ends up simplifying the code. what do you think camille? i've also fixed the tabs to spaces.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12468946/ZOOKEEPER-962_6.patch
          against trunk revision 1055924.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 9 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/102//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/102//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/102//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12468946/ZOOKEEPER-962_6.patch against trunk revision 1055924. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 9 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/102//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/102//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/102//console This message is automatically generated.
          Hide
          Camille Fournier added a comment -

          This looks good to me, thanks for fixing. Go team!

          Show
          Camille Fournier added a comment - This looks good to me, thanks for fixing. Go team!
          Hide
          Benjamin Reed added a comment -

          +1 running final tests, and will then commit.

          Show
          Benjamin Reed added a comment - +1 running final tests, and will then commit.
          Hide
          Benjamin Reed added a comment -

          works great for trunk the new FollowerResync test fails on 3.3 even though the patch applies almost perfectly.

          Show
          Benjamin Reed added a comment - works great for trunk the new FollowerResync test fails on 3.3 even though the patch applies almost perfectly.
          Hide
          Camille Fournier added a comment -

          Taking a look...

          Show
          Camille Fournier added a comment - Taking a look...
          Hide
          Camille Fournier added a comment -

          We need ZOOKEEPER-882 for this patch to work. Applying that to the 3.3 branch causes the tests to pass.

          Show
          Camille Fournier added a comment - We need ZOOKEEPER-882 for this patch to work. Applying that to the 3.3 branch causes the tests to pass.
          Hide
          Benjamin Reed added a comment -

          great find camille! that fixed it. committing to 3.3 now. thanx for sticking with this!

          Show
          Benjamin Reed added a comment - great find camille! that fixed it. committing to 3.3 now. thanx for sticking with this!
          Hide
          Benjamin Reed added a comment -

          trunk: Committed revision 1062244.
          branch-3.3: Committed revision 1062327.

          Show
          Benjamin Reed added a comment - trunk: Committed revision 1062244. branch-3.3: Committed revision 1062327.
          Hide
          Hudson added a comment -

          Integrated in ZooKeeper-trunk #1074 (See https://hudson.apache.org/hudson/job/ZooKeeper-trunk/1074/)
          ZOOKEEPER-962. leader/follower coherence issue when follower is receiving a DIFF

          Show
          Hudson added a comment - Integrated in ZooKeeper-trunk #1074 (See https://hudson.apache.org/hudson/job/ZooKeeper-trunk/1074/ ) ZOOKEEPER-962 . leader/follower coherence issue when follower is receiving a DIFF

            People

            • Assignee:
              ChiaHung Lin
              Reporter:
              Camille Fournier
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development