ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1271

testEarlyLeaderAbandonment failing on solaris - clients not retrying connection

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 3.3.4, 3.4.0, 3.5.0
    • Fix Version/s: 3.3.4, 3.4.0, 3.5.0
    • Component/s: java client
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      See:
      https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ZooKeeper_branch34_solaris/1/testReport/junit/org.apache.zookeeper.server.quorum/QuorumPeerMainTest/testEarlyLeaderAbandonment/

      Notice that the clients attempt to connect before the servers have bound, then 30 seconds later, after seemingly no further client activity we see:

      2011-10-28 21:40:56,828 [myid:] - INFO [main-SendThread(localhost:11227):ClientCnxn$SendThread@1057] - Client session timed out, have not heard from server in 30032ms for sessionid 0x0, closing socket connection and attempting reconnect

      I believe this is different from ZOOKEEPER-1270 because in the 1270 case it seems like the clients are attempting to connect but the servers are not accepting (notice the stat commands are being dropped due to no server running)

      1. solarisClientFailure.txt.gz
        31 kB
        Patrick Hunt
      2. ZOOKEEPER-1271.patch
        0.5 kB
        Mahadev konar
      3. ZOOKEEPER-1271.patch
        6 kB
        Mahadev konar
      4. ZOOKEEPER-1271-3.4.patch
        6 kB
        Mahadev konar
      5. ZOOKEEPER-1271-trunk.patch
        6 kB
        Mahadev konar
      6. ZOOKEEPER-1271-3.3.patch
        0.5 kB
        Mahadev konar

        Issue Links

          Activity

          Hide
          Patrick Hunt added a comment - - edited

          The error handling added to ZOOKEEPER-1174 is causing this bug.

                  try {
                      sockKey = sock.register(selector, SelectionKey.OP_CONNECT);
                      boolean immediateConnect = sock.connect(addr);            
                      if (immediateConnect) {
                          sendThread.primeConnection();
                      }
                  } catch (IOException e) {
                      LOG.error("Unable to open socket to " + addr);
                      sock.close();
                  }
          

          if an exception is thrown inside the try the socket is closed, however sockKey is left set. As a result he client will not attempt to reconnect to the server (typically it will continue to retry every second or so). I think the bug here is that the exception should be rethrown, otw the 'cleanup' routine in SendThread.run will not be executed.

          Show
          Patrick Hunt added a comment - - edited The error handling added to ZOOKEEPER-1174 is causing this bug. try { sockKey = sock.register(selector, SelectionKey.OP_CONNECT); boolean immediateConnect = sock.connect(addr); if (immediateConnect) { sendThread.primeConnection(); } } catch (IOException e) { LOG.error("Unable to open socket to " + addr); sock.close(); } if an exception is thrown inside the try the socket is closed, however sockKey is left set. As a result he client will not attempt to reconnect to the server (typically it will continue to retry every second or so). I think the bug here is that the exception should be rethrown, otw the 'cleanup' routine in SendThread.run will not be executed.
          Hide
          Ted Dunning added a comment -

          I traced through this code and it looks like re-throwing the exception is a good idea. I certainly don't see any problems.

          On the other side, this is definitely not an area of code that I know well.

          Show
          Ted Dunning added a comment - I traced through this code and it looks like re-throwing the exception is a good idea. I certainly don't see any problems. On the other side, this is definitely not an area of code that I know well.
          Hide
          Matthias Spycher added a comment -

          +1 for rethrow.

          Given that startConnect() used to throw and the run() method would retry in case of a java.net.SocketException, let's rethrow.

          Show
          Matthias Spycher added a comment - +1 for rethrow. Given that startConnect() used to throw and the run() method would retry in case of a java.net.SocketException, let's rethrow.
          Hide
          Mahadev konar added a comment -

          Minor patch to rethrow.

          Show
          Mahadev konar added a comment - Minor patch to rethrow.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12501827/ZOOKEEPER-1271.patch
          against trunk revision 1196025.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/756//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/756//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/756//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12501827/ZOOKEEPER-1271.patch against trunk revision 1196025. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/756//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/756//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/756//console This message is automatically generated.
          Hide
          Patrick Hunt added a comment -

          This one really does need a test.

          Show
          Patrick Hunt added a comment - This one really does need a test.
          Hide
          Mahadev konar added a comment -

          Yeah think so too. Trying to come up with a test case. I am tired of integration tests. Trying to see if I can do a real unit test here.

          Show
          Mahadev konar added a comment - Yeah think so too. Trying to come up with a test case. I am tired of integration tests. Trying to see if I can do a real unit test here.
          Hide
          Mahadev konar added a comment -

          Added a test case. Added mockito as a dependency. The test should fail without the patch and should pass with the patch.

          Show
          Mahadev konar added a comment - Added a test case. Added mockito as a dependency. The test should fail without the patch and should pass with the patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12501905/ZOOKEEPER-1271.patch
          against trunk revision 1196025.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/758//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12501905/ZOOKEEPER-1271.patch against trunk revision 1196025. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/758//console This message is automatically generated.
          Hide
          Mahadev konar added a comment -

          Need seperate patches for trunk and 3.4 .

          Show
          Mahadev konar added a comment - Need seperate patches for trunk and 3.4 .
          Hide
          Mahadev konar added a comment -

          3.4 patch.

          Show
          Mahadev konar added a comment - 3.4 patch.
          Hide
          Mahadev konar added a comment -

          Trunk patch.

          Show
          Mahadev konar added a comment - Trunk patch.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12501908/ZOOKEEPER-1271-trunk.patch
          against trunk revision 1196025.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/759//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/759//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/759//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12501908/ZOOKEEPER-1271-trunk.patch against trunk revision 1196025. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/759//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/759//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/759//console This message is automatically generated.
          Hide
          Mahadev konar added a comment -

          Patch for 3.3 branch. The code base in 3.3 makes it almost impossible to write a unit test for this. We can just commit the patch for 3.3 branch? Lesser chances of fix getting removed in 3.3. We can just run the patch on solaris machine and see if it works. Sounds good?

          Show
          Mahadev konar added a comment - Patch for 3.3 branch. The code base in 3.3 makes it almost impossible to write a unit test for this. We can just commit the patch for 3.3 branch? Lesser chances of fix getting removed in 3.3. We can just run the patch on solaris machine and see if it works. Sounds good?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12502033/ZOOKEEPER-1271-3.3.patch
          against trunk revision 1196025.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/769//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12502033/ZOOKEEPER-1271-3.3.patch against trunk revision 1196025. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/769//console This message is automatically generated.
          Hide
          Patrick Hunt added a comment -

          I don't see any way to add a test in 3.3 either w/o significant structural changes (such as the refactorings that went into 3.4, allowing the mock testing used there).

          Given we can reproduce this on apache jenkins solaris systems, it seems that we already have test coverage for this. no?

          Show
          Patrick Hunt added a comment - I don't see any way to add a test in 3.3 either w/o significant structural changes (such as the refactorings that went into 3.4, allowing the mock testing used there). Given we can reproduce this on apache jenkins solaris systems, it seems that we already have test coverage for this. no?
          Hide
          Patrick Hunt added a comment -

          I reviewed and tested this patch on 33/34/trunk, green in all three cases. I also ran this on my CI hardware and I no longer see the issue there either.

          The proof will be on solaris though - this is reproduceable on solaris with the original test set.

          Thanks Mahadev!

          Show
          Patrick Hunt added a comment - I reviewed and tested this patch on 33/34/trunk, green in all three cases. I also ran this on my CI hardware and I no longer see the issue there either. The proof will be on solaris though - this is reproduceable on solaris with the original test set. Thanks Mahadev!
          Hide
          Hudson added a comment -

          Integrated in ZooKeeper-trunk #1353 (See https://builds.apache.org/job/ZooKeeper-trunk/1353/)
          ZOOKEEPER-1271. testEarlyLeaderAbandonment failing on solaris - clients not retrying connection (mahadev via phunt)

          phunt : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1196819
          Files :

          • /zookeeper/trunk/CHANGES.txt
          • /zookeeper/trunk/ivy.xml
          • /zookeeper/trunk/src/java/main/org/apache/zookeeper/ClientCnxnSocketNIO.java
          • /zookeeper/trunk/src/java/test/org/apache/zookeeper/ClientReconnectTest.java
          Show
          Hudson added a comment - Integrated in ZooKeeper-trunk #1353 (See https://builds.apache.org/job/ZooKeeper-trunk/1353/ ) ZOOKEEPER-1271 . testEarlyLeaderAbandonment failing on solaris - clients not retrying connection (mahadev via phunt) phunt : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1196819 Files : /zookeeper/trunk/CHANGES.txt /zookeeper/trunk/ivy.xml /zookeeper/trunk/src/java/main/org/apache/zookeeper/ClientCnxnSocketNIO.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/ClientReconnectTest.java
          Hide
          Matthias Spycher added a comment -

          I've verified the rethrow also works for 3.3 on Windows7 where it was previously failing.

          Show
          Matthias Spycher added a comment - I've verified the rethrow also works for 3.3 on Windows7 where it was previously failing.
          Hide
          Patrick Hunt added a comment -

          Matthias - that's great. ps. Are you seeing the same failures as we have on Apache Jenkins?
          https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ZooKeeper-trunk-WinVS2008_java/10/#showFailuresLink
          (granted this is trunk not 3.3)

          Show
          Patrick Hunt added a comment - Matthias - that's great. ps. Are you seeing the same failures as we have on Apache Jenkins? https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ZooKeeper-trunk-WinVS2008_java/10/#showFailuresLink (granted this is trunk not 3.3)
          Hide
          Matthias Spycher added a comment -

          I just happened to be running some app-level unit tests on windows and saw a discrepancy with linux due to IPv6 support.

          Show
          Matthias Spycher added a comment - I just happened to be running some app-level unit tests on windows and saw a discrepancy with linux due to IPv6 support.

            People

            • Assignee:
              Mahadev konar
              Reporter:
              Patrick Hunt
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development