ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1870

flakey test in StandaloneDisabledTest.startSingleServerTest

    Details

    • Type: Bug Bug
    • Status: Patch Available
    • Priority: Blocker Blocker
    • Resolution: Unresolved
    • Affects Version/s: 3.5.0
    • Fix Version/s: 3.5.0
    • Component/s: tests
    • Labels:
      None

      Description

      I'm seeing lots of the following failure. Seems like a flakey test (passes every so often).

      junit.framework.AssertionFailedError: client could not connect to reestablished quorum: giving up after 30+ seconds.
      	at org.apache.zookeeper.test.ReconfigTest.testNormalOperation(ReconfigTest.java:143)
      	at org.apache.zookeeper.server.quorum.StandaloneDisabledTest.startSingleServerTest(StandaloneDisabledTest.java:75)
      	at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
      

      I've found 3 problems:

      1. QuorumCnxManager.Listener.run() leaks the socket depending on when the shutdown flag gets set.
      2. QuorumCnxManager.halt() doesn't wait for the listener to terminate.
      3. QuorumPeer.shuttingDownLE flag doesn't get reset when restarting the leader election.

      1. test.log
        170 kB
        Raul Gutierrez Segales
      2. ZOOKEEPER-1870.patch
        7 kB
        Michi Mutsuzaki
      3. ZOOKEEPER-1870.patch
        5 kB
        Michi Mutsuzaki

        Issue Links

          Activity

          Hide
          Michi Mutsuzaki added a comment -

          I can't reproduce it now, but I think there was a case where the quorum peer incorrectly became a leader after shutdown got called if the proposedLeader wasn't set to -1. I'm guessing it could happen if shutdown() gets called right before this block of code gets executed. Maybe there is a way to shutdown the leader election more cleanly?

          /*
           * This predicate is true once we don't read any new
           * relevant message from the reception queue
           */
          if (n == null) {
              self.setPeerState((proposedLeader == self.getId()) ?
                      ServerState.LEADING: learningState());
          
              Vote endVote = new Vote(proposedLeader,
                      proposedZxid, proposedEpoch);
              leaveInstance(endVote);
              return endVote;
          }
          

          Yes, I think we should fix this in 3.4. I'll upload a separate patch for 3.4.

          Show
          Michi Mutsuzaki added a comment - I can't reproduce it now, but I think there was a case where the quorum peer incorrectly became a leader after shutdown got called if the proposedLeader wasn't set to -1. I'm guessing it could happen if shutdown() gets called right before this block of code gets executed. Maybe there is a way to shutdown the leader election more cleanly? /* * This predicate is true once we don't read any new * relevant message from the reception queue */ if (n == null) { self.setPeerState((proposedLeader == self.getId()) ? ServerState.LEADING: learningState()); Vote endVote = new Vote(proposedLeader, proposedZxid, proposedEpoch); leaveInstance(endVote); return endVote; } Yes, I think we should fix this in 3.4. I'll upload a separate patch for 3.4.
          Hide
          Flavio Junqueira added a comment - - edited

          This looks good to me. I was just wondering if there is a concrete reason for setting proposedLeader to -1 when we shut it down. Is it necessary or just good to have?

          I was also wondering if we should check this into the 3.4 branch as well as trunk.

          Show
          Flavio Junqueira added a comment - - edited This looks good to me. I was just wondering if there is a concrete reason for setting proposedLeader to -1 when we shut it down. Is it necessary or just good to have? I was also wondering if we should check this into the 3.4 branch as well as trunk.
          Hide
          Flavio Junqueira added a comment -

          I'll have a look today.

          Show
          Flavio Junqueira added a comment - I'll have a look today.
          Hide
          Michi Mutsuzaki added a comment -

          Patrick Hunt / Flavio Junqueira, could one of you take a look at the patch?

          Show
          Michi Mutsuzaki added a comment - Patrick Hunt / Flavio Junqueira , could one of you take a look at the patch?
          Hide
          Michi Mutsuzaki added a comment -

          Thank you for running the test again Raul.

          Show
          Michi Mutsuzaki added a comment - Thank you for running the test again Raul.
          Hide
          Raul Gutierrez Segales added a comment -

          The last patch is passing every time so far (66 runs thus far).

          Show
          Raul Gutierrez Segales added a comment - The last patch is passing every time so far (66 runs thus far).
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12635811/ZOOKEEPER-1870.patch
          against trunk revision 1577756.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1970//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1970//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1970//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12635811/ZOOKEEPER-1870.patch against trunk revision 1577756. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1970//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1970//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1970//console This message is automatically generated.
          Hide
          Michi Mutsuzaki added a comment -

          3 additional changes:

          • Reset proposedLeader to -1 in FastLeaderElection.shutdown().
          • Get out of the WorkerReceiver.run() loop after calling self.getElectionAlg().shutdown().
          • Make FastLeaderElection.getVote() public for unit test. Let me know if making this method public is ok with you guys.
          Show
          Michi Mutsuzaki added a comment - 3 additional changes: Reset proposedLeader to -1 in FastLeaderElection.shutdown(). Get out of the WorkerReceiver.run() loop after calling self.getElectionAlg().shutdown(). Make FastLeaderElection.getVote() public for unit test. Let me know if making this method public is ok with you guys.
          Hide
          Alexander Shraer added a comment -

          thanks Michi!

          Show
          Alexander Shraer added a comment - thanks Michi!
          Hide
          Michi Mutsuzaki added a comment -

          Another change I added was to reset proposedLeader to -1 in FastLeaderElection.shutdown(). I'll run the test 200 times before uploading the patch this time

          Show
          Michi Mutsuzaki added a comment - Another change I added was to reset proposedLeader to -1 in FastLeaderElection.shutdown(). I'll run the test 200 times before uploading the patch this time
          Hide
          Alexander Shraer added a comment -

          yes, looks like you're right. It sets stop to true but then there's a bunch of code that may still be executed in the remainder of the loop, so break sounds like a good idea.

          Show
          Alexander Shraer added a comment - yes, looks like you're right. It sets stop to true but then there's a bunch of code that may still be executed in the remainder of the loop, so break sounds like a good idea.
          Hide
          Michi Mutsuzaki added a comment -

          Alexander Shraer, it looks like the problem is in FastLeaderElection. WorkerReceiver.run() doesn't get out of the while loop after calling self.getElectionAlg().shutdown(), and the node 1 is becoming the leader when it shouldn't. Should we put break after self.getElectionAlg().shutdown() so that the rest of the logic doesn't get executed when restarting the leader election?

          Show
          Michi Mutsuzaki added a comment - Alexander Shraer , it looks like the problem is in FastLeaderElection. WorkerReceiver.run() doesn't get out of the while loop after calling self.getElectionAlg().shutdown(), and the node 1 is becoming the leader when it shouldn't. Should we put break after self.getElectionAlg().shutdown() so that the rest of the logic doesn't get executed when restarting the leader election?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12635699/test.log
          against trunk revision 1577756.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 15 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1967//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12635699/test.log against trunk revision 1577756. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 15 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1967//console This message is automatically generated.
          Hide
          Raul Gutierrez Segales added a comment -

          Oh, and it fails with:

              [junit] 2014-03-19 17:50:42,869 [myid:] - INFO  [main:ZKTestCase$1@66] - FAILED startSingleServerTest
              [junit] java.lang.AssertionError: Server 1 is not up
              [junit] 	at org.junit.Assert.fail(Assert.java:93)
              [junit] 	at org.junit.Assert.assertTrue(Assert.java:43)
            ....
          
          Show
          Raul Gutierrez Segales added a comment - Oh, and it fails with: [junit] 2014-03-19 17:50:42,869 [myid:] - INFO [main:ZKTestCase$1@66] - FAILED startSingleServerTest [junit] java.lang.AssertionError: Server 1 is not up [junit] at org.junit.Assert.fail(Assert.java:93) [junit] at org.junit.Assert.assertTrue(Assert.java:43) ....
          Hide
          Raul Gutierrez Segales added a comment -

          Hi Michi,

          Platform is Fedora Linux, with 3.13 Kernel on x86_64:

          $ uname -r
          3.13.6-200.fc20.x86_64
          $ arch
          x86_64
          $ cat /etc/fedora-release 
          Fedora release 20 (Heisenbug)
          

          This is mostly out of trunk plus some other patches that I had (but mostly unrelated). I'll run again out of pure trunk.

          Show
          Raul Gutierrez Segales added a comment - Hi Michi, Platform is Fedora Linux, with 3.13 Kernel on x86_64: $ uname -r 3.13.6-200.fc20.x86_64 $ arch x86_64 $ cat /etc/fedora-release Fedora release 20 (Heisenbug) This is mostly out of trunk plus some other patches that I had (but mostly unrelated). I'll run again out of pure trunk.
          Hide
          Michi Mutsuzaki added a comment -

          Thanks for running the test Raul. It seems like there are more things to fix. The log file should be under build/test/logs/ even without the -Dtest.output=yes option. Which platform are you using?

          --Michi

          Show
          Michi Mutsuzaki added a comment - Thanks for running the test Raul. It seems like there are more things to fix. The log file should be under build/test/logs/ even without the -Dtest.output=yes option. Which platform are you using? --Michi
          Hide
          Raul Gutierrez Segales added a comment -

          Also, I was only running that test not the whole suite. I.e.:

          while :; do ant -Dtestcase=StandaloneDisabledTest test-core-java ; done | tee test.log
          
          Show
          Raul Gutierrez Segales added a comment - Also, I was only running that test not the whole suite. I.e.: while :; do ant -Dtestcase=StandaloneDisabledTest test-core-java ; done | tee test.log
          Hide
          Raul Gutierrez Segales added a comment -

          I just got a failure with this patch after 44 iterations:

          junit.run:
              [junit] WARNING: multiple versions of ant detected in path for junit 
              [junit]          jar:file:/usr/share/java/ant/ant.jar!/org/apache/tools/ant/Project.class
              [junit]      and jar:file:/usr/share/ant/lib/ant.jar!/org/apache/tools/ant/Project.class
              [junit] Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=utf8
              [junit] Running org.apache.zookeeper.server.quorum.StandaloneDisabledTest
              [junit] Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 61.398 sec
          

          I was running without -Dtest.output=yes, alas . Will run again with -Dtest.output=yes.

          Show
          Raul Gutierrez Segales added a comment - I just got a failure with this patch after 44 iterations: junit.run: [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/usr/share/java/ant/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:file:/usr/share/ant/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=utf8 [junit] Running org.apache.zookeeper.server.quorum.StandaloneDisabledTest [junit] Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 61.398 sec I was running without -Dtest.output=yes, alas . Will run again with -Dtest.output=yes.
          Hide
          Helen Hastings added a comment -

          Ran it here 100 times as well and they all passed. Thank you Michi, +1!

          Show
          Helen Hastings added a comment - Ran it here 100 times as well and they all passed. Thank you Michi, +1!
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12635000/ZOOKEEPER-1870.patch
          against trunk revision 1577756.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1966//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1966//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1966//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12635000/ZOOKEEPER-1870.patch against trunk revision 1577756. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1966//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1966//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1966//console This message is automatically generated.
          Hide
          Michi Mutsuzaki added a comment -

          I think it's another flaky test that's not related to this patch. I ran the test on my box like 100 times and they all passed. Let me run the build again to see if it fails again.

          Show
          Michi Mutsuzaki added a comment - I think it's another flaky test that's not related to this patch. I ran the test on my box like 100 times and they all passed. Let me run the build again to see if it fails again.
          Hide
          Alexander Shraer added a comment -

          on second thought it looks like some C tests are failing ?

          Show
          Alexander Shraer added a comment - on second thought it looks like some C tests are failing ?
          Hide
          Alexander Shraer added a comment -

          +1, thanks Michi!

          if possible, please update the description of the jira to reflect the problems you found (as you did on reviewboard)

          Show
          Alexander Shraer added a comment - +1, thanks Michi! if possible, please update the description of the jira to reflect the problems you found (as you did on reviewboard)
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12635000/ZOOKEEPER-1870.patch
          against trunk revision 1577756.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1965//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1965//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1965//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12635000/ZOOKEEPER-1870.patch against trunk revision 1577756. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1965//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1965//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1965//console This message is automatically generated.
          Show
          Michi Mutsuzaki added a comment - https://reviews.apache.org/r/19269/
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12635000/ZOOKEEPER-1870.patch
          against trunk revision 1577756.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1964//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1964//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1964//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12635000/ZOOKEEPER-1870.patch against trunk revision 1577756. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1964//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1964//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1964//console This message is automatically generated.
          Hide
          Michi Mutsuzaki added a comment -

          There is another problem: QuorumPeer.restartLeaderElection() doesn't clear the shuttingDownLE flag.

          Show
          Michi Mutsuzaki added a comment - There is another problem: QuorumPeer.restartLeaderElection() doesn't clear the shuttingDownLE flag.
          Hide
          Patrick Hunt added a comment -

          fwiw I'm still seeing this alot on my setup (once a day at least, it's the most common cause of CI job failure for me). Thanks for making it a priority!

          Show
          Patrick Hunt added a comment - fwiw I'm still seeing this alot on my setup (once a day at least, it's the most common cause of CI job failure for me). Thanks for making it a priority!
          Hide
          Michi Mutsuzaki added a comment -

          Ok I think I know what the problem is. There is race between QuorumCnxManager.Listener.run() and QuorumCnxManager.Listener.halt() that causes the socket to leak.

          1. QuorumCnxManager.Listener.run() goes into the while loop while((!shutdown) && (numRetries < 3))
          2. QuorumCnxManager.halt() gets called, sets shutdown to true and calls QuorumCnxManager.Listener.halt().
          3. QuorumCnxManager.Listener.halt() closes the socket.
          4. QuorumCnxManager.Listener.run() binds the socket and breaks out of the while loop since the shutdown flag is set.

          I'll upload a patch.

          Show
          Michi Mutsuzaki added a comment - Ok I think I know what the problem is. There is race between QuorumCnxManager.Listener.run() and QuorumCnxManager.Listener.halt() that causes the socket to leak. 1. QuorumCnxManager.Listener.run() goes into the while loop while((!shutdown) && (numRetries < 3)) 2. QuorumCnxManager.halt() gets called, sets shutdown to true and calls QuorumCnxManager.Listener.halt() . 3. QuorumCnxManager.Listener.halt() closes the socket. 4. QuorumCnxManager.Listener.run() binds the socket and breaks out of the while loop since the shutdown flag is set. I'll upload a patch.
          Hide
          Michi Mutsuzaki added a comment -

          Ok thank you for the update Deepak. I was hoping ZOOKEEPER-1805 would fix this issue. I'm assigning this back to Helen.

          Show
          Michi Mutsuzaki added a comment - Ok thank you for the update Deepak. I was hoping ZOOKEEPER-1805 would fix this issue. I'm assigning this back to Helen.
          Hide
          Deepak Jagtap added a comment -

          Hi Michi,

          On my setup StandaloneDisabledTest fails even without 1805 patch.
          I checkout revision 1574686 and build shows StandaloneDisabledTest consitently fails.
          It also fails with 1805 patch applied.

          Thanks & Regards,
          Deepak

          Show
          Deepak Jagtap added a comment - Hi Michi, On my setup StandaloneDisabledTest fails even without 1805 patch. I checkout revision 1574686 and build shows StandaloneDisabledTest consitently fails. It also fails with 1805 patch applied. Thanks & Regards, Deepak
          Hide
          Michi Mutsuzaki added a comment -

          Does this test fail even with ZOOKEEPER-1805 applied?

          Show
          Michi Mutsuzaki added a comment - Does this test fail even with ZOOKEEPER-1805 applied?
          Hide
          Helen Hastings added a comment -

          Thanks Michi. However, I also see this error happen every once in a while when the test succeeds. I believe the reason it happens more often when the test fails is because over 50 seconds are spent trying to get servers 1 and 2 to connect (we get all the way to here):

          2014-02-28 02:30:39,673 [myid:2] - INFO [QuorumPeer[myid=2]/127.0.0.1:11227:FastLeaderElection@846] - Notification time out: 51200

          so there is just more time/opportunity for the error to happen, as opposed to the success case when the servers are able to connect within a second or faster.

          This could still be related even though it happens in both the success and failure case. Either way I'll continue looking into it on my own as well.

          Show
          Helen Hastings added a comment - Thanks Michi. However, I also see this error happen every once in a while when the test succeeds. I believe the reason it happens more often when the test fails is because over 50 seconds are spent trying to get servers 1 and 2 to connect (we get all the way to here): 2014-02-28 02:30:39,673 [myid:2] - INFO [QuorumPeer [myid=2] /127.0.0.1:11227:FastLeaderElection@846] - Notification time out: 51200 so there is just more time/opportunity for the error to happen, as opposed to the success case when the servers are able to connect within a second or faster. This could still be related even though it happens in both the success and failure case. Either way I'll continue looking into it on my own as well.
          Hide
          Michi Mutsuzaki added a comment -

          Unassigned this ticket from Helen since it doesn't seem like the failure is directly caused by the disable-standalone mode.

          Show
          Michi Mutsuzaki added a comment - Unassigned this ticket from Helen since it doesn't seem like the failure is directly caused by the disable-standalone mode.
          Hide
          Michi Mutsuzaki added a comment -

          I see a lot of log messages like this whenever this test fails:

          2014-03-10 23:51:28,550 [myid:1] - INFO  [WorkerSender[myid=1]:QuorumCnxManager@195] - Have smaller server identifier, so dropping the connection: (2, 1)
          

          From the recent mailing list discussion, it looks like this is related to ZOOKEEPER-1805 and ZOOKEEPER-1810.

          http://mail-archives.apache.org/mod_mbox/zookeeper-user/201402.mbox/%3CCAEH-zfq4uxUqi9D4KrD8EvPaU3MxDUt7WHQKDPNCPDQoYAbP6g@mail.gmail.com%3E

          Show
          Michi Mutsuzaki added a comment - I see a lot of log messages like this whenever this test fails: 2014-03-10 23:51:28,550 [myid:1] - INFO [WorkerSender[myid=1]:QuorumCnxManager@195] - Have smaller server identifier, so dropping the connection: (2, 1) From the recent mailing list discussion, it looks like this is related to ZOOKEEPER-1805 and ZOOKEEPER-1810 . http://mail-archives.apache.org/mod_mbox/zookeeper-user/201402.mbox/%3CCAEH-zfq4uxUqi9D4KrD8EvPaU3MxDUt7WHQKDPNCPDQoYAbP6g@mail.gmail.com%3E
          Hide
          Michi Mutsuzaki added a comment -

          Reassigning it to Helen.

          Show
          Michi Mutsuzaki added a comment - Reassigning it to Helen.
          Hide
          Michi Mutsuzaki added a comment -

          Alexander Shraer, could you take a look?

          Show
          Michi Mutsuzaki added a comment - Alexander Shraer , could you take a look?
          Hide
          Patrick Hunt added a comment -

          looks like this was introduced by ZOOKEEPER-1691

          Show
          Patrick Hunt added a comment - looks like this was introduced by ZOOKEEPER-1691
          Hide
          Patrick Hunt added a comment -
          Show
          Patrick Hunt added a comment - recent failure: https://builds.apache.org/job/ZooKeeper-trunk-jdk7/767/

            People

            • Assignee:
              Helen Hastings
              Reporter:
              Patrick Hunt
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:

                Development