Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 3.3.3
    • Fix Version/s: 3.3.4, 3.4.0, 3.5.0
    • Component/s: java client
    • Labels:
      None

      Description

      In the socket connection logic there are several errors that result in bad behavior. The basic problem is that a socket is registered with a selector unconditionally when there are nuances that should be dealt with. First, the socket may connect immediately. Secondly, the connect may throw an exception. In either of these two cases, I don't think that the socket should be registered.

      I will attach a test case that demonstrates the problem. I have been unable to create a unit test that exhibits the problem because I would have to mock the low level socket libraries to do so. It would still be good to do so if somebody can figure out a good way.

      1. zk-fd-leak.tgz
        7 kB
        Ted Dunning
      2. ZOOKEEPER-1174.patch
        1 kB
        Ted Dunning
      3. ZOOKEEPER-1174.patch
        3 kB
        Ted Dunning
      4. ZOOKEEPER-1174.patch
        3 kB
        Ted Dunning
      5. ZOOKEEPER-1174.patch
        6 kB
        Ted Dunning
      6. ZOOKEEPER-1174.patch
        3 kB
        Ted Dunning
      7. ZOOKEEPER-1174fix.patch
        3 kB
        Camille Fournier
      8. ZOOKEEPER-1174-3.3.patch
        3 kB
        Camille Fournier

        Issue Links

          Activity

          Hide
          Patrick Hunt added a comment -

          np, this reminds me though, we are only testing the c code on windows (CI env), not java, it would be good to add this - Matthias you interested to help? https://builds.apache.org//view/S-Z/view/ZooKeeper/job/ZooKeeper-trunk-WinVS2008/

          If so please start the discussion on the ML. thanks!

          Show
          Patrick Hunt added a comment - np, this reminds me though, we are only testing the c code on windows (CI env), not java, it would be good to add this - Matthias you interested to help? https://builds.apache.org//view/S-Z/view/ZooKeeper/job/ZooKeeper-trunk-WinVS2008/ If so please start the discussion on the ML. thanks!
          Hide
          Matthias Spycher added a comment -

          Patrick, thanks for the ref.

          Show
          Matthias Spycher added a comment - Patrick, thanks for the ref.
          Hide
          Patrick Hunt added a comment -

          Matthias - see ZOOKEEPER-1271, it's the same issues re handling the exception, but on solaris

          Show
          Patrick Hunt added a comment - Matthias - see ZOOKEEPER-1271 , it's the same issues re handling the exception, but on solaris
          Hide
          Matthias Spycher added a comment -

          I've run into a problem with this patch (version 3.3.3) on a system (Windows7) where InetAddress.getAllByName(host) returns candidate IPv4 and IPv6 addresses.

          The reason is that the IOException caught in SendThread.startConnect() is no longer propagated to the calling run() method. In my logs before the patch I would see:

          • Opening socket connection to server localhost/0:0:0:0:0:0:0:1:23233
          • Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
            java.net.SocketException: Address family not supported by protocol family: connect
            at sun.nio.ch.Net.connect(Native Method)
            at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:500)
            at org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1050)
            at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1077)

          and now I see:

          • Opening socket connection to server localhost/0:0:0:0:0:0:0:1:23233
          • Unable to open socket to localhost/0:0:0:0:0:0:0:1:23233
          • Client session timed out, have not heard from server in 30002ms for sessionid 0x0, closing socket connection and attempting reconnect

          In the former, the exception was caught in the run() method and the startConnect() retried with the IPv4 address, which works fine. In the latter, the client times out waiting for the server instead of retrying.

          I would recommend rethrowing the IOException in startConnect() until there's a better way to control the InetAddresses in the client.

          Show
          Matthias Spycher added a comment - I've run into a problem with this patch (version 3.3.3) on a system (Windows7) where InetAddress.getAllByName(host) returns candidate IPv4 and IPv6 addresses. The reason is that the IOException caught in SendThread.startConnect() is no longer propagated to the calling run() method. In my logs before the patch I would see: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:23233 Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.SocketException: Address family not supported by protocol family: connect at sun.nio.ch.Net.connect(Native Method) at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:500) at org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1050) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1077) and now I see: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:23233 Unable to open socket to localhost/0:0:0:0:0:0:0:1:23233 Client session timed out, have not heard from server in 30002ms for sessionid 0x0, closing socket connection and attempting reconnect In the former, the exception was caught in the run() method and the startConnect() retried with the IPv4 address, which works fine. In the latter, the client times out waiting for the server instead of retrying. I would recommend rethrowing the IOException in startConnect() until there's a better way to control the InetAddresses in the client.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12497223/ZOOKEEPER-1174-3.3.patch
          against trunk revision 1177432.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/600//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12497223/ZOOKEEPER-1174-3.3.patch against trunk revision 1177432. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/600//console This message is automatically generated.
          Hide
          Hudson added a comment -

          Integrated in ZooKeeper-trunk #1318 (See https://builds.apache.org/job/ZooKeeper-trunk/1318/)
          ZOOKEEPER-1174. FD leak when network unreachable (Ted Dunning via camille)

          camille : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1177042
          Files :

          • /zookeeper/trunk/CHANGES.txt
          • /zookeeper/trunk/src/java/main/org/apache/zookeeper/ClientCnxn.java
          • /zookeeper/trunk/src/java/main/org/apache/zookeeper/ClientCnxnSocketNIO.java
          • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/DataTree.java
          Show
          Hudson added a comment - Integrated in ZooKeeper-trunk #1318 (See https://builds.apache.org/job/ZooKeeper-trunk/1318/ ) ZOOKEEPER-1174 . FD leak when network unreachable (Ted Dunning via camille) camille : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1177042 Files : /zookeeper/trunk/CHANGES.txt /zookeeper/trunk/src/java/main/org/apache/zookeeper/ClientCnxn.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/ClientCnxnSocketNIO.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/DataTree.java
          Hide
          Mahadev konar added a comment -

          Ted/Camille, Any update on this? Looks like this was committed to 3.4 branch, and probably trunk?

          Show
          Mahadev konar added a comment - Ted/Camille, Any update on this? Looks like this was committed to 3.4 branch, and probably trunk?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12496886/ZOOKEEPER-1174fix.patch
          against trunk revision 1176903.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/594//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/594//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/594//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12496886/ZOOKEEPER-1174fix.patch against trunk revision 1176903. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/594//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/594//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/594//console This message is automatically generated.
          Hide
          Ted Dunning added a comment -

          This is a pretty serious bug if you ever wind up in the corner where it is exercised. That proviso limits the average seriousness of the bug, but file descriptor leaks are never good.

          Show
          Ted Dunning added a comment - This is a pretty serious bug if you ever wind up in the corner where it is exercised. That proviso limits the average seriousness of the bug, but file descriptor leaks are never good.
          Hide
          Camille Fournier added a comment -

          Also this should apply to 3.4 which hopefully doesn't have the change from 786 in it, so I will put it in there. Did we also want to put it in 3.3.4? Are those two releases going out at the same time?

          Show
          Camille Fournier added a comment - Also this should apply to 3.4 which hopefully doesn't have the change from 786 in it, so I will put it in there. Did we also want to put it in 3.3.4? Are those two releases going out at the same time?
          Hide
          Camille Fournier added a comment -

          Good catch Ted, I must ask the same question myself.

          Show
          Camille Fournier added a comment - Good catch Ted, I must ask the same question myself.
          Hide
          Ted Dunning added a comment -

          The old patch fails to apply because of a change introduced in the fix for ZOOKEEPER-786. The chance is this:

          @@ -185,9 +190,7 @@ public class ClientCnxnSocketNIO extends ClientCnxnSocket {
          sock.socket().setSoLinger(false, -1);
          sock.socket().setTcpNoDelay(true);
          sockKey = sock.register(selector, SelectionKey.OP_CONNECT);

          • if (sock.connect(addr)) { - sendThread.primeConnection(); - }

            + sock.connect(addr);
            initialized = false;

          Why was the primeConnection call deleted? I can update the patch to account for this but that seems a bit dangerous since the commit comment on this patch doesn't refer to this change at all.

          Show
          Ted Dunning added a comment - The old patch fails to apply because of a change introduced in the fix for ZOOKEEPER-786 . The chance is this: @@ -185,9 +190,7 @@ public class ClientCnxnSocketNIO extends ClientCnxnSocket { sock.socket().setSoLinger(false, -1); sock.socket().setTcpNoDelay(true); sockKey = sock.register(selector, SelectionKey.OP_CONNECT); if (sock.connect(addr)) { - sendThread.primeConnection(); - } + sock.connect(addr); initialized = false; Why was the primeConnection call deleted? I can update the patch to account for this but that seems a bit dangerous since the commit comment on this patch doesn't refer to this change at all.
          Hide
          Ted Dunning added a comment -

          Hmmm... let me try to update the patch.

          I don't know when Wednesday night your time actually is (I am traveling to distant lands and am very confused just now).

          Show
          Ted Dunning added a comment - Hmmm... let me try to update the patch. I don't know when Wednesday night your time actually is (I am traveling to distant lands and am very confused just now).
          Hide
          Ted Dunning added a comment -

          No. No update beyond that last patch. Should be ready to roll.

          I talked to Camille last week and it sounded like she was on the verge of committing it.

          Show
          Ted Dunning added a comment - No. No update beyond that last patch. Should be ready to roll. I talked to Camille last week and it sounded like she was on the verge of committing it.
          Hide
          Mahadev konar added a comment - - edited

          Wed night my time?

          Show
          Mahadev konar added a comment - - edited Wed night my time?
          Hide
          Camille Fournier added a comment -

          I was going to put this in, but I don't have a patch that cleanly applies, so it will take some work. I'll look at it tomorrow or Wed, when are you planning on doing the release?

          Show
          Camille Fournier added a comment - I was going to put this in, but I don't have a patch that cleanly applies, so it will take some work. I'll look at it tomorrow or Wed, when are you planning on doing the release?
          Hide
          Mahadev konar added a comment -

          Ted,
          Any update on this? Please let me know. I plan to cut a release soon and would like to get this in.

          thanks

          Show
          Mahadev konar added a comment - Ted, Any update on this? Please let me know. I plan to cut a release soon and would like to get this in. thanks
          Hide
          Patrick Hunt added a comment -

          This jira has this set currently:

          Fix Version/s: 3.3.4, 3.4.0, 3.5.0

          So a patch, or patches, that would be applied to branch-3.3, branch-3.4, and trunk.

          Show
          Patrick Hunt added a comment - This jira has this set currently: Fix Version/s: 3.3.4, 3.4.0, 3.5.0 So a patch, or patches, that would be applied to branch-3.3, branch-3.4, and trunk.
          Hide
          Ted Dunning added a comment -

          OK.

          Can you say specifically which branches you mean?

          Show
          Ted Dunning added a comment - OK. Can you say specifically which branches you mean?
          Hide
          Patrick Hunt added a comment -

          Looks like we may need to update the patch, perhaps separate patches for each branch not that branch-3.4 has been created?

          Show
          Patrick Hunt added a comment - Looks like we may need to update the patch, perhaps separate patches for each branch not that branch-3.4 has been created?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12494741/ZOOKEEPER-1174.patch
          against trunk revision 1170886.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/554//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12494741/ZOOKEEPER-1174.patch against trunk revision 1170886. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/554//console This message is automatically generated.
          Hide
          Ted Dunning added a comment -

          This patch should be ready to commit. Tests are removed pending another JIRA.

          Show
          Ted Dunning added a comment - This patch should be ready to commit. Tests are removed pending another JIRA.
          Hide
          Patrick Hunt added a comment -

          If we go wth powermock let's use the mockito variety.

          Show
          Patrick Hunt added a comment - If we go wth powermock let's use the mockito variety.
          Hide
          Camille Fournier added a comment -

          Yeah, I think that sounds like a plan.

          Show
          Camille Fournier added a comment - Yeah, I think that sounds like a plan.
          Hide
          Ted Dunning added a comment -

          Mocking the test sounds great. Using this bug to bring in a mocking technology that can mock
          static methods is a little more ambitious than I wanted it to be.

          I see that jmockit and powermock both claim the ability to do this. Powermock requires another
          mocking technology underneath. Jmockit has the problem that it isn't available in an official
          maven repo.

          My tendency is to suggest that we commit this without the unit test and open another JIRA to address
          the testing problem in general.

          If I can get sign-off on that, then I will produce a final patch to verify. The code right now stands
          like this:

                  try {
                      boolean immediateConnect = sock.connect(addr);
                      sockKey = sock.register(selector, SelectionKey.OP_CONNECT);
                      if (immediateConnect) {
                          sendThread.primeConnection();
                      }
                  } catch (IOException e) {
                      sock.close();
                  }
                  initialized = false;
          
          Show
          Ted Dunning added a comment - Mocking the test sounds great. Using this bug to bring in a mocking technology that can mock static methods is a little more ambitious than I wanted it to be. I see that jmockit and powermock both claim the ability to do this. Powermock requires another mocking technology underneath. Jmockit has the problem that it isn't available in an official maven repo. My tendency is to suggest that we commit this without the unit test and open another JIRA to address the testing problem in general. If I can get sign-off on that, then I will produce a final patch to verify. The code right now stands like this: try { boolean immediateConnect = sock.connect(addr); sockKey = sock.register(selector, SelectionKey.OP_CONNECT); if (immediateConnect) { sendThread.primeConnection(); } } catch (IOException e) { sock.close(); } initialized = false ;
          Hide
          Camille Fournier added a comment -

          Maybe a little cheesy. Could we do this with mocks? I'm not crazy about having a random "injectSocketError" flag in the code just for testing.

          Also, should probably go ahead and log the socket connection error in the same way we do in SendThread.run, so people don't lose logging information.

          Also, I think you need to register the SockKey before calling primeConnection, otherwise the call in primeConnection to clientCnxnSocket.enableReadWriteOnly() will fail.

          Show
          Camille Fournier added a comment - Maybe a little cheesy. Could we do this with mocks? I'm not crazy about having a random "injectSocketError" flag in the code just for testing. Also, should probably go ahead and log the socket connection error in the same way we do in SendThread.run, so people don't lose logging information. Also, I think you need to register the SockKey before calling primeConnection, otherwise the call in primeConnection to clientCnxnSocket.enableReadWriteOnly() will fail.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12493712/ZOOKEEPER-1174.patch
          against trunk revision 1165443.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/518//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12493712/ZOOKEEPER-1174.patch against trunk revision 1165443. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/518//console This message is automatically generated.
          Hide
          Ted Dunning added a comment -

          Here is a cheesy test. The idea is that I injected an explicit throw of the same exception that a downed internet connection causes.

          Is this just toooo cheesy to stomach?

          Show
          Ted Dunning added a comment - Here is a cheesy test. The idea is that I injected an explicit throw of the same exception that a downed internet connection causes. Is this just toooo cheesy to stomach?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12493693/ZOOKEEPER-1174.patch
          against trunk revision 1165443.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/517//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/517//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/517//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12493693/ZOOKEEPER-1174.patch against trunk revision 1165443. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/517//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/517//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/517//console This message is automatically generated.
          Hide
          Ted Dunning added a comment -

          Try again with different format to please the checking script.

          Show
          Ted Dunning added a comment - Try again with different format to please the checking script.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12493692/ZOOKEEPER-1174.patch
          against trunk revision 1165443.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/516//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12493692/ZOOKEEPER-1174.patch against trunk revision 1165443. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/516//console This message is automatically generated.
          Hide
          Ted Dunning added a comment -

          Here is an updated patch that maintains the sockKey even for immediate loads. My guess is that this didn't matter in testing so far because it is rare for an async socket to connect instantly.

          This addresses Camille's eagle-eyed comments.

          I have added a few javadoc fixes and one weakening of a catch from Exception to Throwable in the general spirit of making things better when I see them. They are unrelated to this JIRA, but are very minor so do not warrant their own bug report.

          Show
          Ted Dunning added a comment - Here is an updated patch that maintains the sockKey even for immediate loads. My guess is that this didn't matter in testing so far because it is rare for an async socket to connect instantly. This addresses Camille's eagle-eyed comments. I have added a few javadoc fixes and one weakening of a catch from Exception to Throwable in the general spirit of making things better when I see them. They are unrelated to this JIRA, but are very minor so do not warrant their own bug report.
          Hide
          Camille Fournier added a comment -

          Well, it seems like all of the io-related calls use the sockKey; enableWrite, enableRead, cleanup, doIO. I feel like I'm missing some major fundamental point.

          Show
          Camille Fournier added a comment - Well, it seems like all of the io-related calls use the sockKey; enableWrite, enableRead, cleanup, doIO. I feel like I'm missing some major fundamental point.
          Hide
          Ted Dunning added a comment -

          Correct. Is sockKey needed if we don't register with the selector?

          Show
          Ted Dunning added a comment - Correct. Is sockKey needed if we don't register with the selector?
          Hide
          Camille Fournier added a comment -

          -The documentation for sock.connect is exactly what I base my (current) position on. The idea is that if connect returns true, then you don't need to use select to wait for the connection and can proceed immediately with the primeConnection and light up the connection for prime time. It is only if connect returns false that deferred actions are required.-

          But where are you setting sockKey then? You're not setting it at all if it returns true immediately on the first time this is called.

          Show
          Camille Fournier added a comment - - The documentation for sock.connect is exactly what I base my (current) position on. The idea is that if connect returns true, then you don't need to use select to wait for the connection and can proceed immediately with the primeConnection and light up the connection for prime time. It is only if connect returns false that deferred actions are required. - But where are you setting sockKey then? You're not setting it at all if it returns true immediately on the first time this is called.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12493648/ZOOKEEPER-1174.patch
          against trunk revision 1165443.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/515//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/515//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/515//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12493648/ZOOKEEPER-1174.patch against trunk revision 1165443. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/515//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/515//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/515//console This message is automatically generated.
          Hide
          Ted Dunning added a comment -

          Camille,

          Thanks for looking at this. I am not sure if it my assertion is true either, but it does seem correct to me. (happily, I expressed some doubt)

          The documentation for sock.connect is exactly what I base my (current) position on. The idea is that if connect returns true, then you don't need to use select to wait for the connection and can proceed immediately with the primeConnection and light up the connection for prime time. It is only if connect returns false that deferred actions are required.

          Show
          Ted Dunning added a comment - Camille, Thanks for looking at this. I am not sure if it my assertion is true either, but it does seem correct to me. (happily, I expressed some doubt) The documentation for sock.connect is exactly what I base my (current) position on. The idea is that if connect returns true, then you don't need to use select to wait for the connection and can proceed immediately with the primeConnection and light up the connection for prime time. It is only if connect returns false that deferred actions are required.
          Hide
          Camille Fournier added a comment -

          Sorry, trigger happy about reading your comments. I'm not sure this is true:

          Secondly, is it safe to not register sockets that connect immediately? I think, but am not sure, that the answer is yes because we have clearly already called primeConnection().

          The documentation for sock.connect seems to indicate that you could return true even in a non-blocking mode:
          "If this channel is in non-blocking mode then an invocation of this method initiates a non-blocking connection operation. If the connection is established immediately, as can happen with a local connection, then this method returns true. "

          Show
          Camille Fournier added a comment - Sorry, trigger happy about reading your comments. I'm not sure this is true: Secondly, is it safe to not register sockets that connect immediately? I think, but am not sure, that the answer is yes because we have clearly already called primeConnection(). The documentation for sock.connect seems to indicate that you could return true even in a non-blocking mode: "If this channel is in non-blocking mode then an invocation of this method initiates a non-blocking connection operation. If the connection is established immediately, as can happen with a local connection, then this method returns true. "
          Hide
          Camille Fournier added a comment -

          Ted, don't we still need to register the sockKey even if sock.connect returns true?

          Show
          Camille Fournier added a comment - Ted, don't we still need to register the sockKey even if sock.connect returns true?
          Hide
          Ted Dunning added a comment -

          Here is a proposed patch. There are a few considerations here that merit review.

          First, is it safe to register sockets with a selector after the connect call? I assert yes because select is level based rather than transition based.

          Secondly, is it safe to not register sockets that connect immediately? I think, but am not sure, that the answer is yes because we have clearly already called primeConnection().

          Thirdly, is it OK to not rethrow the io exception from the connect call? I am not sure here. The immediate effect is that connection is only attempted at the timeout rate rather than the faster rate specified by some of the delays in the code. This seems OK at first glance, but other opinions would be nice to have.

          Show
          Ted Dunning added a comment - Here is a proposed patch. There are a few considerations here that merit review. First, is it safe to register sockets with a selector after the connect call? I assert yes because select is level based rather than transition based. Secondly, is it safe to not register sockets that connect immediately? I think, but am not sure, that the answer is yes because we have clearly already called primeConnection(). Thirdly, is it OK to not rethrow the io exception from the connect call? I am not sure here. The immediate effect is that connection is only attempted at the timeout rate rather than the faster rate specified by some of the delays in the code. This seems OK at first glance, but other opinions would be nice to have.
          Hide
          Ted Dunning added a comment -

          Here is a program that demonstrates the problem. It includes a README and sample output with and without the fix.

          Show
          Ted Dunning added a comment - Here is a program that demonstrates the problem. It includes a README and sample output with and without the fix.
          Hide
          Camille Fournier added a comment -

          We'd have to change drastically the try/catch logic to try to connect the socket then register it with the selector.
          Should we just fix this by calling selector.selectNow() in the cleanup method after cancelling the sockKey? I think that might fix the leak.

          Show
          Camille Fournier added a comment - We'd have to change drastically the try/catch logic to try to connect the socket then register it with the selector. Should we just fix this by calling selector.selectNow() in the cleanup method after cancelling the sockKey? I think that might fix the leak.

            People

            • Assignee:
              Ted Dunning
              Reporter:
              Ted Dunning
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development