ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1309

Creating a new ZooKeeper client can leak file handles

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 3.3.4
    • Fix Version/s: 3.3.5
    • Component/s: java client
    • Labels:
      None

      Description

      If there is an IOException thrown by the constructor of ClientCnxn then file handles are leaked because of the initialization of the Selector which is never closed.

      final Selector selector = Selector.open();

      If there is an abnormal exit from the constructor then the Selector is not closed and file handles are leaked. You can easily see this by setting the hosts string to garbage ("qwerty", "asdf", etc.) and then try to open a new ZooKeeper connection. I've observed the same behavior in production when there were DNS issues where the host names of the ensemble can no longer be resolved and the application servers quickly run out of handles attempting to (re)connect to zookeeper.

      1. zk-1309-1.patch
        2 kB
        Daniel Lord
      2. zk-1309-1.patch
        2 kB
        Daniel Lord
      3. zk-1309-1.patch
        2 kB
        Daniel Lord
      4. zk-1309-3.patch
        2 kB
        Daniel Lord

        Activity

        Hide
        Camille Fournier added a comment -

        Ran tests and they all passed, so I'm gonna check this in to 3.3.5.

        Show
        Camille Fournier added a comment - Ran tests and they all passed, so I'm gonna check this in to 3.3.5.
        Hide
        Patrick Hunt added a comment -

        I've marked for 3.3.5 per Daniel's suggestion. qabot will likely fail, we'll have to make sure we run the tests manually.

        Show
        Patrick Hunt added a comment - I've marked for 3.3.5 per Daniel's suggestion. qabot will likely fail, we'll have to make sure we run the tests manually.
        Hide
        Daniel Lord added a comment -

        It looks to me that this issue (and fix) don't apply to branches-3.4 and trunk. The structure of the zookeeper client seems to have changed a lot in 3.4. For example the ClientCnxn used to leak file handles because it initialized the Selector.open() before it attempted to resolve any InetAddresses for the ensemble. This is no longer the case. If you look at the constructor of ZooKeeper now the StaticHostProvider constructor is responsible for resolving the InetAddresses. If any of the addresses are bad then StaticHostProvider will throw an UnknownHostException and the whole ZooKeeper handle will fall apart and won't leak any resources.

        My original test for this was to connect to a single host named "qwerty" where "qwerty" does not resolve to a valid host and to watch the file handles grow very quickly. I can see that in this test there is no file handle leak using the 3.4 zookeeper client.

        Objections on this jira applying only to 3.3.x?

        Show
        Daniel Lord added a comment - It looks to me that this issue (and fix) don't apply to branches-3.4 and trunk. The structure of the zookeeper client seems to have changed a lot in 3.4. For example the ClientCnxn used to leak file handles because it initialized the Selector.open() before it attempted to resolve any InetAddresses for the ensemble. This is no longer the case. If you look at the constructor of ZooKeeper now the StaticHostProvider constructor is responsible for resolving the InetAddresses. If any of the addresses are bad then StaticHostProvider will throw an UnknownHostException and the whole ZooKeeper handle will fall apart and won't leak any resources. My original test for this was to connect to a single host named "qwerty" where "qwerty" does not resolve to a valid host and to watch the file handles grow very quickly. I can see that in this test there is no file handle leak using the 3.4 zookeeper client. Objections on this jira applying only to 3.3.x?
        Hide
        Patrick Hunt added a comment -

        No. That was my point earlier about how this is just kluged together (if someone wants to help on this lmk). The bot uses the last patch you attached to the jira and applies it to trunk regardless of any other setting.

        Show
        Patrick Hunt added a comment - No. That was my point earlier about how this is just kluged together (if someone wants to help on this lmk). The bot uses the last patch you attached to the jira and applies it to trunk regardless of any other setting.
        Hide
        Daniel Lord added a comment -

        OK and then when I select "Patch available" and select the fix version the qabot will attempt the patch on that version(s)?

        Show
        Daniel Lord added a comment - OK and then when I select "Patch available" and select the fix version the qabot will attempt the patch on that version(s)?
        Hide
        Patrick Hunt added a comment -

        qabot looks for patches of jiras in the "patch available" state (ones you submitted) and automatically verifies the patch when it (the most recent patch) changes (iirc on the change part, takes 5-10 minutes to notice, if not just cancel and re-submit).

        Show
        Patrick Hunt added a comment - qabot looks for patches of jiras in the "patch available" state (ones you submitted) and automatically verifies the patch when it (the most recent patch) changes (iirc on the change part, takes 5-10 minutes to notice, if not just cancel and re-submit).
        Hide
        Daniel Lord added a comment -

        OK thanks for the help Patrick. Just to clarify how do I make the qa bot apply to a branch build? I'll check out the other branches (and trunk) to see where a new patch needs to be created for those versions.

        Show
        Daniel Lord added a comment - OK thanks for the help Patrick. Just to clarify how do I make the qa bot apply to a branch build? I'll check out the other branches (and trunk) to see where a new patch needs to be created for those versions.
        Hide
        Patrick Hunt added a comment -

        Looks like trunk (auto patch testing only tests against trunk) is different enough from 3.3 that your now properly formatted patch is not applying due to conflicts. Typically what we do is create patches for each branch in this case - say we should include this in 3.3/3.4/3.5 you might need to create a separate patch for 3.4/3.5 than 3.3 (in some cases we end up with 3 patches). Also, we typically name the patches ZOOKEEPER-####.patch (ZOOKEEPER-####_br34.patch, etc...), jira handles uploading new versions with the same name properly. Also - qa bot always uses the most recent patch. So you need to upload 3.3/3.4 before 3.5 patch if you want to patch to be auto-tested correctly. (sorry about that but it's pretty bare bones and that's how it currently works).

        Show
        Patrick Hunt added a comment - Looks like trunk (auto patch testing only tests against trunk) is different enough from 3.3 that your now properly formatted patch is not applying due to conflicts. Typically what we do is create patches for each branch in this case - say we should include this in 3.3/3.4/3.5 you might need to create a separate patch for 3.4/3.5 than 3.3 (in some cases we end up with 3 patches). Also, we typically name the patches ZOOKEEPER-####.patch (ZOOKEEPER-####_br34.patch, etc...), jira handles uploading new versions with the same name properly. Also - qa bot always uses the most recent patch. So you need to upload 3.3/3.4 before 3.5 patch if you want to patch to be auto-tested correctly. (sorry about that but it's pretty bare bones and that's how it currently works).
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12511557/zk-1309-3.patch
        against trunk revision 1234974.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/913//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12511557/zk-1309-3.patch against trunk revision 1234974. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/913//console This message is automatically generated.
        Hide
        Patrick Hunt added a comment -

        Hi Daniel the issue is this:

        branches/branch-3.3/src/java/main/org/apache/zookeeper/ClientCnxn.java

        our processes assume everything is rooted in a toplevel directory that contains "src" (e.g. "trunk"), in this case that's branches/branch-3.3 directory

        see: https://cwiki.apache.org/confluence/display/ZOOKEEPER/HowToContribute

        Show
        Patrick Hunt added a comment - Hi Daniel the issue is this: branches/branch-3.3/src/java/main/org/apache/zookeeper/ClientCnxn.java our processes assume everything is rooted in a toplevel directory that contains "src" (e.g. "trunk"), in this case that's branches/branch-3.3 directory see: https://cwiki.apache.org/confluence/display/ZOOKEEPER/HowToContribute
        Hide
        Daniel Lord added a comment -

        Hi Mahadev, the latest patch is from the root check out directory – the one that houses both trunk and branches. Is that still not correct? The patch was for a branch but I can make the same change to the trunk if needed.

        We've been running the patch for a while on production and it works nicely.

        Show
        Daniel Lord added a comment - Hi Mahadev, the latest patch is from the root check out directory – the one that houses both trunk and branches. Is that still not correct? The patch was for a branch but I can make the same change to the trunk if needed. We've been running the patch for a while on production and it works nicely.
        Hide
        Mahadev konar added a comment -

        Daniel,
        The patch doesnt seem to be generated from the trunk/top of the zk source tree. Can you please do that so that hudson can run the patch?

        thanks

        Also, I am moving this to 3.4.2.

        Show
        Mahadev konar added a comment - Daniel, The patch doesnt seem to be generated from the trunk/top of the zk source tree. Can you please do that so that hudson can run the patch? thanks Also, I am moving this to 3.4.2.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12506345/zk-1309-1.patch
        against trunk revision 1208979.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/806//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12506345/zk-1309-1.patch against trunk revision 1208979. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/806//console This message is automatically generated.
        Hide
        Daniel Lord added a comment -

        Hi Mahadev, I've received corporate approval to contribute this patch. What did you need me to do with the patch in order to make it official?

        Show
        Daniel Lord added a comment - Hi Mahadev, I've received corporate approval to contribute this patch. What did you need me to do with the patch in order to make it official?
        Hide
        Patrick Hunt added a comment -

        cancelling for now (until Daniel gets approval)

        Show
        Patrick Hunt added a comment - cancelling for now (until Daniel gets approval)
        Hide
        Daniel Lord added a comment -

        Sorry Mahadev, I may have jumped the gun on this patch a little bit. I'm working on getting corporate approval to submit patches for both bug fixes and enhancements now.

        Show
        Daniel Lord added a comment - Sorry Mahadev, I may have jumped the gun on this patch a little bit. I'm working on getting corporate approval to submit patches for both bug fixes and enhancements now.
        Hide
        Mahadev konar added a comment -

        Daniel,
        Can you please reattach the patch and grant license to apache for inclusion?

        Show
        Mahadev konar added a comment - Daniel, Can you please reattach the patch and grant license to apache for inclusion?
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12505555/zk-1309-1.patch
        against trunk revision 1202557.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/799//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12505555/zk-1309-1.patch against trunk revision 1202557. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/799//console This message is automatically generated.
        Hide
        Daniel Lord added a comment -

        Patch addresses the issue by moving the ownership of the Selector down in to the SendThread. The SendThread is the only meaningful user of the Selector and is responsible for calling close() so it should, in my opinion, be the owner of the Selector.

        The only use of the Selector outside of the SendThread is when queueing a packet wakeup() is called when holding the SendThread mutex. I've created a new private synchronized method on SendThread to avoid leaking that there is a Selector.

        This patch will fix the problem because the SendThread is instantiated AFTER the InetAddress lookup so the Selector will not be opened unless we have successfully instantiated the SendThread.

        Show
        Daniel Lord added a comment - Patch addresses the issue by moving the ownership of the Selector down in to the SendThread. The SendThread is the only meaningful user of the Selector and is responsible for calling close() so it should, in my opinion, be the owner of the Selector. The only use of the Selector outside of the SendThread is when queueing a packet wakeup() is called when holding the SendThread mutex. I've created a new private synchronized method on SendThread to avoid leaking that there is a Selector. This patch will fix the problem because the SendThread is instantiated AFTER the InetAddress lookup so the Selector will not be opened unless we have successfully instantiated the SendThread.

          People

          • Assignee:
            Daniel Lord
            Reporter:
            Daniel Lord
          • Votes:
            3 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development