ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1124

Multiop submitted to non-leader always fails due to timeout

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 3.4.0
    • Fix Version/s: 3.4.0
    • Component/s: server
    • Labels:
      None
    • Environment:

      all

    • Hadoop Flags:
      Reviewed

      Description

      The new Multiop support added under zookeeper-965 fails every single time if the multiop is submitted to a non-leader in quorum mode. In standalone mode it always works properly and this bug only presents itself in quorum mode (with 2 or more nodes). After 12 hours of debugging (sigh) it turns out to be a really simple fix. There are a couple of missing case statements inside FollowerRequestProcessor.java and ObserverRequestProcessor.java to ensure that multiop is forwarded to the leader for commit. I've attached a patch that fixes this problem.

      It's probably worth nothing that zookeeper-965 has already been committed to trunk. But this is a fatal flaw that will prevent multiop support from working properly and as such needs to get committed to 3.4.0 as well. Is there a way to tie these two cases together in some way?

      1. ZOOKEEPER-1124.patch
        4 kB
        Marshall McMullen

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in ZooKeeper-trunk #1244 (See https://builds.apache.org/job/ZooKeeper-trunk/1244/)
          ZOOKEEPER-1124. Multiop submitted to non-leader always fails due to timeout

          breed : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1146961
          Files :

          • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/FollowerRequestProcessor.java
          • /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/QuorumTest.java
          • /zookeeper/trunk/CHANGES.txt
          • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/ObserverRequestProcessor.java
          Show
          Hudson added a comment - Integrated in ZooKeeper-trunk #1244 (See https://builds.apache.org/job/ZooKeeper-trunk/1244/ ) ZOOKEEPER-1124 . Multiop submitted to non-leader always fails due to timeout breed : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1146961 Files : /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/FollowerRequestProcessor.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/QuorumTest.java /zookeeper/trunk/CHANGES.txt /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/ObserverRequestProcessor.java
          Hide
          Marshall McMullen added a comment -

          Thanks for fixing the imports Benjamin, totally appreciate it. I was just sitting down to fix that, so I was happy to see an email from you that you'd already taken care of it .

          Show
          Marshall McMullen added a comment - Thanks for fixing the imports Benjamin, totally appreciate it. I was just sitting down to fix that, so I was happy to see an email from you that you'd already taken care of it .
          Hide
          Benjamin Reed added a comment -

          +1 i resolved the imports before checking in the patch. thanks for the patch marshall!

          Committed revision 1146961.

          Show
          Benjamin Reed added a comment - +1 i resolved the imports before checking in the patch. thanks for the patch marshall! Committed revision 1146961.
          Hide
          Benjamin Reed added a comment -

          looks great. this is ready to go. the only thing that needs to be fixed is the import org.apache.zookeeper.*; we never import *, so it needs to be expanded.

          Show
          Benjamin Reed added a comment - looks great. this is ready to go. the only thing that needs to be fixed is the import org.apache.zookeeper.*; we never import *, so it needs to be expanded.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12486402/ZOOKEEPER-1124.patch
          against trunk revision 1146025.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/394//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/394//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/394//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12486402/ZOOKEEPER-1124.patch against trunk revision 1146025. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/394//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/394//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/394//console This message is automatically generated.
          Hide
          Marshall McMullen added a comment -

          Replacing earlier patch with a better one that contains a unit test that exhibits the fix works properly. The unit test essentially connects to a follower, then submits a multiop to the follower. It then verifies that the multiop succeeded properly.

          When I run this unit test WITHOUT the required fixes in FollowerRequestProcessor.java and ObserverRequestProcessor.java, then I get a nice failure that correctly replicates the failures I've seen in our integration of multi:

          Testcase: testMultiToFollower took 28.451 sec
          Caused an ERROR
          KeeperErrorCode = ConnectionLoss
          org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
          at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
          at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:886)
          at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:876)
          at org.apache.zookeeper.test.QuorumTest.testMultiToFollower(QuorumTest.java:89)
          at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)

          WITH the fixes in the included patch, the unit test passes correctly.

          Show
          Marshall McMullen added a comment - Replacing earlier patch with a better one that contains a unit test that exhibits the fix works properly. The unit test essentially connects to a follower, then submits a multiop to the follower. It then verifies that the multiop succeeded properly. When I run this unit test WITHOUT the required fixes in FollowerRequestProcessor.java and ObserverRequestProcessor.java, then I get a nice failure that correctly replicates the failures I've seen in our integration of multi: Testcase: testMultiToFollower took 28.451 sec Caused an ERROR KeeperErrorCode = ConnectionLoss org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:886) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:876) at org.apache.zookeeper.test.QuorumTest.testMultiToFollower(QuorumTest.java:89) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) WITH the fixes in the included patch, the unit test passes correctly.
          Hide
          Marshall McMullen added a comment -

          Replacing this patch with an updated one with a proper unit test.

          Show
          Marshall McMullen added a comment - Replacing this patch with an updated one with a proper unit test.
          Hide
          Mahadev konar added a comment -

          Marshall, take a look at QuorumTest to see how to start a real quorom set of servers.

          Show
          Mahadev konar added a comment - Marshall, take a look at QuorumTest to see how to start a real quorom set of servers.
          Hide
          Mahadev konar added a comment -

          Marshall, take a look at QuorumTest to see how to start a real quorom set of servers.

          Show
          Mahadev konar added a comment - Marshall, take a look at QuorumTest to see how to start a real quorom set of servers.
          Hide
          Marshall McMullen added a comment -

          Thanks for the suggestion Camille. I'll try to add a unit test this evening that verifies behavior.

          Show
          Marshall McMullen added a comment - Thanks for the suggestion Camille. I'll try to add a unit test this evening that verifies behavior.
          Hide
          Camille Fournier added a comment -

          The QuorumBase and associated tests are probably what you want, Marshall. Those tests can be used to set up a quorum of ZKs, and if you look at tests that extend it tehy will show you how to figure out if you are connected to a leader or follower.

          Show
          Camille Fournier added a comment - The QuorumBase and associated tests are probably what you want, Marshall. Those tests can be used to set up a quorum of ZKs, and if you look at tests that extend it tehy will show you how to figure out if you are connected to a leader or follower.
          Hide
          Marshall McMullen added a comment -

          I agree with Ted's comment regarding the default here is to fail (drop on floor) when new op types are added. I thought of throwing an exception if it wasn't in the switch so it would fail at runtime instead of silently failing as it does now. But perhaps the better answer is to instead of enumerating all op types, just pass them all through to the appropriate Observer/Follower. Then let it deal with throwing an exception if the type isn't supported. That or if there are explicit types not supported by an observer/follower, deal with those explicitly then assume the rest are valid and pass them along.

          Show
          Marshall McMullen added a comment - I agree with Ted's comment regarding the default here is to fail (drop on floor) when new op types are added. I thought of throwing an exception if it wasn't in the switch so it would fail at runtime instead of silently failing as it does now. But perhaps the better answer is to instead of enumerating all op types, just pass them all through to the appropriate Observer/Follower. Then let it deal with throwing an exception if the type isn't supported. That or if there are explicit types not supported by an observer/follower, deal with those explicitly then assume the rest are valid and pass them along.
          Hide
          Marshall McMullen added a comment -

          I wrote unit tests in our own local test harness to verify the fix. I also tested manually to verify the fix.

          I wasn't able to figure out how to write a proper zookeeper unit test in either the C or Java unit tests as I couldn't find a single example where we spawn off multiple zookeeper servers in quorum mode. Is there an example of this somewhere that I've missed? I see lots of examples of starting up one zookeeper server, and various mock instances of servers, but nowhere that I could find where we start multiple real servers.

          Show
          Marshall McMullen added a comment - I wrote unit tests in our own local test harness to verify the fix. I also tested manually to verify the fix. I wasn't able to figure out how to write a proper zookeeper unit test in either the C or Java unit tests as I couldn't find a single example where we spawn off multiple zookeeper servers in quorum mode. Is there an example of this somewhere that I've missed? I see lots of examples of starting up one zookeeper server, and various mock instances of servers, but nowhere that I could find where we start multiple real servers.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12486329/multi-non-observer.patch
          against trunk revision 1146025.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/390//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/390//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/390//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12486329/multi-non-observer.patch against trunk revision 1146025. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/390//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/390//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/390//console This message is automatically generated.
          Hide
          Ted Dunning added a comment -

          All,

          Is there a way to restructure the code so that naive implementors don't run into this situation? Essentially, the code as it stands is default-fail and it would be nice to make it be default-succeed in the presence of new ops.

          Or is the addition of new ops rare enough that this doesn't matter?

          Show
          Ted Dunning added a comment - All, Is there a way to restructure the code so that naive implementors don't run into this situation? Essentially, the code as it stands is default-fail and it would be nice to make it be default-succeed in the presence of new ops. Or is the addition of new ops rare enough that this doesn't matter?
          Hide
          Ted Dunning added a comment -

          Marshall,

          This fix is clearly important. Do you have any tests?

          The role of these tests is not just to verify this bug, but also to provide a prototype for any later implementors of new operations.

          Show
          Ted Dunning added a comment - Marshall, This fix is clearly important. Do you have any tests? The role of these tests is not just to verify this bug, but also to provide a prototype for any later implementors of new operations.
          Hide
          Ted Dunning added a comment -

          This bug is a defect in the committed version of ZOOKEEPER-965

          Show
          Ted Dunning added a comment - This bug is a defect in the committed version of ZOOKEEPER-965
          Hide
          Marshall McMullen added a comment -

          Patch to add OpCode.multi case statements into FollowerRequestProcessor.java and ObserverRequestProcessor.java.

          Show
          Marshall McMullen added a comment - Patch to add OpCode.multi case statements into FollowerRequestProcessor.java and ObserverRequestProcessor.java.

            People

            • Assignee:
              Marshall McMullen
              Reporter:
              Marshall McMullen
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development