ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1144

ZooKeeperServer not starting on leader due to a race condition

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 3.4.0
    • Fix Version/s: 3.4.0
    • Component/s: None
    • Labels:
      None

      Description

      I have found one problem that is causing QuorumPeerMainTest:testQuorum to fail. This test uses 2 ZK servers.

      The test is failing because leader is not starting ZooKeeperServer after leader election. so everything halts.

      With the new changes, the server is now started in Leader.processAck() which is called from LeaderHandler. processAck() starts ZooKeeperServer if majority have acked NEWLEADER. The leader puts its ack in the the ackSet in Leader.lead(). Since processAck() is called from LearnerHandler it can happen that the learner's ack is processed before the leader is able to put its ack in the ackSet. When LearnerHandler invokes processAck(), the ackSet for newLeaderProposal will not have quorum (in this case 2). As a result, the ZooKeeperServer is never started on the Leader.

      The leader needs to ensure that its ack is put in ackSet before starting LearnerCnxAcceptor or invoke processAck() itself after adding to ackSet. I haven't had time to go through the ZAB2 changes so I am not too familiar with the code. Can Ben/Flavio fix this?

        Issue Links

          Activity

          Hide
          Vishal Kher added a comment -

          Actually I think I see. If we set up the epoch first, the LearnerHandler thread can complete to processAck before the leader thread completes to add itself to the ack set, causing the bug. Good fix. +1 from me, will check this in.

          Yes, in addition, per my understanding, waitForEpochAck() which is called by LearnerHandler before calling processAck() ensures correctness per the ZAB 1.0 design.

          Show
          Vishal Kher added a comment - Actually I think I see. If we set up the epoch first, the LearnerHandler thread can complete to processAck before the leader thread completes to add itself to the ack set, causing the bug. Good fix. +1 from me, will check this in. Yes, in addition, per my understanding, waitForEpochAck() which is called by LearnerHandler before calling processAck() ensures correctness per the ZAB 1.0 design.
          Hide
          Hudson added a comment -

          Integrated in ZooKeeper-trunk #1261 (See https://builds.apache.org/job/ZooKeeper-trunk/1261/)
          ZOOKEEPER-1144: ZooKeeperServer not starting on leader due to a race condition (Vishal K via camille)

          camille : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1156649
          Files :

          • /zookeeper/trunk/CHANGES.txt
          • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/Leader.java
          Show
          Hudson added a comment - Integrated in ZooKeeper-trunk #1261 (See https://builds.apache.org/job/ZooKeeper-trunk/1261/ ) ZOOKEEPER-1144 : ZooKeeperServer not starting on leader due to a race condition (Vishal K via camille) camille : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1156649 Files : /zookeeper/trunk/CHANGES.txt /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/Leader.java
          Hide
          Camille Fournier added a comment -

          Committed to trunk, rev 115664

          Show
          Camille Fournier added a comment - Committed to trunk, rev 115664
          Hide
          Camille Fournier added a comment -

          Actually I think I see. If we set up the epoch first, the LearnerHandler thread can complete to processAck before the leader thread completes to add itself to the ack set, causing the bug. Good fix. +1 from me, will check this in.

          Show
          Camille Fournier added a comment - Actually I think I see. If we set up the epoch first, the LearnerHandler thread can complete to processAck before the leader thread completes to add itself to the ack set, causing the bug. Good fix. +1 from me, will check this in.
          Hide
          Camille Fournier added a comment -

          Vishal, can you just explain quickly why moving the ackSet.add to its new location (above waitForEpochAck but below the LearnerCnxAcceptor start) will fix the bug as you described it in the bug report? I'm not super familiar with this part of the code, but if you can just point it out to me I'm happy to +1 this and check it in asap so our builds will stop failing.

          Show
          Camille Fournier added a comment - Vishal, can you just explain quickly why moving the ackSet.add to its new location (above waitForEpochAck but below the LearnerCnxAcceptor start) will fix the bug as you described it in the bug report? I'm not super familiar with this part of the code, but if you can just point it out to me I'm happy to +1 this and check it in asap so our builds will stop failing.
          Hide
          Vishal Kher added a comment -

          Existing tests were enough to catch the problem and also to verify the fix.

          Show
          Vishal Kher added a comment - Existing tests were enough to catch the problem and also to verify the fix.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12489635/ZOOKEEPER-1144.patch
          against trunk revision 1152141.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/437//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/437//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/437//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12489635/ZOOKEEPER-1144.patch against trunk revision 1152141. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/437//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/437//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/437//console This message is automatically generated.
          Hide
          Vishal Kher added a comment -

          Submitting patch for trunk.

          Show
          Vishal Kher added a comment - Submitting patch for trunk.
          Hide
          Vishal Kher added a comment -

          Attaching patch. All tests passed for me after this fix.

          Show
          Vishal Kher added a comment - Attaching patch. All tests passed for me after this fix.
          Hide
          Vishal Kher added a comment -

          Fix looks to be very simple. Running tests now, will submit patch if tests succeed.

          Show
          Vishal Kher added a comment - Fix looks to be very simple. Running tests now, will submit patch if tests succeed.
          Hide
          Vishal Kher added a comment -

          adding links to test failures likely caused by this bugs

          Show
          Vishal Kher added a comment - adding links to test failures likely caused by this bugs
          Hide
          Eugene Koontz added a comment -

          Thanks Vishal, hopefully this is the same problem with the same solution.

          Show
          Eugene Koontz added a comment - Thanks Vishal, hopefully this is the same problem with the same solution.
          Hide
          Eugene Koontz added a comment -

          Vishal Kher writes to dev list:
          "Error looks similar to what I have been seeing. I reported the cause of the problem here: https://issues.apache.org/jira/browse/ZOOKEEPER-1144
          ".

          Show
          Eugene Koontz added a comment - Vishal Kher writes to dev list: "Error looks similar to what I have been seeing. I reported the cause of the problem here: https://issues.apache.org/jira/browse/ZOOKEEPER-1144 ".

            People

            • Assignee:
              Vishal Kher
              Reporter:
              Vishal Kher
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development