Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-1620

NIOServerCnxnFactory (new code introduced in ZK-1504) opens selectors but never closes them

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.5.0
    • Fix Version/s: 3.5.0
    • Component/s: server
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      New code (committed in ZK-1504) opens selectors but doesn't close them.
      Specifically AbstractSelectThread in its constructor does

      this.selector = Selector.open();

      But possibly also elsewhere. Tests fail for me with the following message:

      java.io.IOException: Too many open files
      at sun.nio.ch.EPollArrayWrapper.epollCreate(Native Method)
      at sun.nio.ch.EPollArrayWrapper.<init>(EPollArrayWrapper.java:69)
      at sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:52)
      at sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:18)
      at java.nio.channels.Selector.open(Selector.java:209)
      at org.apache.zookeeper.server.NIOServerCnxnFactory$AbstractSelectThread.<init>(NIOServerCnxnFactory.java:128)
      at org.apache.zookeeper.server.NIOServerCnxnFactory$AcceptThread.<init>(NIOServerCnxnFactory.java:177)
      at org.apache.zookeeper.server.NIOServerCnxnFactory.configure(NIOServerCnxnFactory.java:663)
      at org.apache.zookeeper.server.ServerCnxnFactory.createFactory(ServerCnxnFactory.java:127)
      at org.apache.zookeeper.server.quorum.QuorumPeer.<init>(QuorumPeer.java:709)
      at org.apache.zookeeper.test.QuorumBase.startServers(QuorumBase.java:177)
      at org.apache.zookeeper.test.QuorumBase.setUp(QuorumBase.java:113)
      at org.apache.zookeeper.test.QuorumBase.setUp(QuorumBase.java:71)
      at org.apache.zookeeper.test.ReconfigTest.setUp(ReconfigTest.java:56)

      1. ZOOKEEPER-1620.patch
        4 kB
        Thawan Kooburat
      2. ZOOKEEPER-1620.patch
        4 kB
        Thawan Kooburat

        Issue Links

          Activity

          Hide
          shrauner Jay Shrauner added a comment -

          Have you tried running the tests in an unmodified branch? I think you are encountering this because of changes in your branch, because in the upstream branch we don't see the tests fail like this.

          That being said, you are right we could do a better job of cleaning up the selector. I'd suggest adding a closeSelector() method to the AbstractSelectThread

          protected void closeSelector() {
          try {
          selector.close();
          catch (IOException e)

          { // Ignore }

          }

          and then adding calls to closeSelector() right before the NIOServerCnxnFactory.this.stop() calls in the finally block of the run methods for the AcceptThread and SelectorThread.

          Show
          shrauner Jay Shrauner added a comment - Have you tried running the tests in an unmodified branch? I think you are encountering this because of changes in your branch, because in the upstream branch we don't see the tests fail like this. That being said, you are right we could do a better job of cleaning up the selector. I'd suggest adding a closeSelector() method to the AbstractSelectThread protected void closeSelector() { try { selector.close(); catch (IOException e) { // Ignore } } and then adding calls to closeSelector() right before the NIOServerCnxnFactory.this.stop() calls in the finally block of the run methods for the AcceptThread and SelectorThread.
          Hide
          shralex Alexander Shraer added a comment -

          Thanks Jay. The tests that fail are related to reconfiguration, but the issue I think is not specific to reconfiguration, that's why I opened it as a separate bug. Let me make the change you suggest in my ZK-107 patch first to see if it solves the issue, and then we can submit it as a separate patch here.

          Show
          shralex Alexander Shraer added a comment - Thanks Jay. The tests that fail are related to reconfiguration, but the issue I think is not specific to reconfiguration, that's why I opened it as a separate bug. Let me make the change you suggest in my ZK-107 patch first to see if it solves the issue, and then we can submit it as a separate patch here.
          Hide
          thawan Thawan Kooburat added a comment -

          Fix fd leakage with unit test (for unix-based system)

          Show
          thawan Thawan Kooburat added a comment - Fix fd leakage with unit test (for unix-based system)
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12565253/ZOOKEEPER-1620.patch
          against trunk revision 1433651.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          -1 release audit. The applied patch generated 25 release audit warnings (more than the trunk's current 24 warnings).

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1339//testReport/
          Release audit warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1339//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1339//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1339//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12565253/ZOOKEEPER-1620.patch against trunk revision 1433651. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. -1 release audit. The applied patch generated 25 release audit warnings (more than the trunk's current 24 warnings). +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1339//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1339//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1339//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1339//console This message is automatically generated.
          Hide
          thawan Thawan Kooburat added a comment -

          Add missing license header

          Show
          thawan Thawan Kooburat added a comment - Add missing license header
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12565259/ZOOKEEPER-1620.patch
          against trunk revision 1433651.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1340//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1340//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1340//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12565259/ZOOKEEPER-1620.patch against trunk revision 1433651. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1340//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1340//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1340//console This message is automatically generated.
          Hide
          shralex Alexander Shraer added a comment -

          looks good, thanks Thawan

          Show
          shralex Alexander Shraer added a comment - looks good, thanks Thawan
          Hide
          shralex Alexander Shraer added a comment -

          on second thought, this still has a problem.
          when running it on my Mac the test fails with "End fdcount is: 340"

          Show
          shralex Alexander Shraer added a comment - on second thought, this still has a problem. when running it on my Mac the test fails with "End fdcount is: 340"
          Hide
          shralex Alexander Shraer added a comment -

          Thawan, I suggest that you change the last assertion to "(endFdCount - startFdCount) < 2)"
          and in NIOServerCnxnFactory.stop() add thread.closeSelector() right before or right after thread.wakeupSelector()

          Show
          shralex Alexander Shraer added a comment - Thawan, I suggest that you change the last assertion to "(endFdCount - startFdCount) < 2)" and in NIOServerCnxnFactory.stop() add thread.closeSelector() right before or right after thread.wakeupSelector()
          Hide
          shralex Alexander Shraer added a comment -

          actually it would be good to reduce this to < 1, but I haven't found the remaining open selector.

          Show
          shralex Alexander Shraer added a comment - actually it would be good to reduce this to < 1, but I haven't found the remaining open selector.
          Hide
          shrauner Jay Shrauner added a comment -

          You shouldn't need to do that (explicitly call closeSelector inside factory.stop()), and it's preferable if you don't. The reason it's preferable not to do it is to allow the system to do a graceful shutdown. The reason you shouldn't need to do that is that the shutdown call joins on the accept and selector threads, so it's not going to return until those threads exit.

          The unit tests create and destroy the factories/threads/selectors in a pretty tight loop. I'm not sure how long it takes the system to close fd's associated with the selector, but maybe this is something like how closed sockets can linger. I might try to put a lengthier sleep after the factory.shutdown() call.

          There have been known issues with file descriptor leaks when calling selector.close(), so we might check JDK versions we're each running. I found the following bug affecting JDK5u28/6u30/7u5
          http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7118373

          Show
          shrauner Jay Shrauner added a comment - You shouldn't need to do that (explicitly call closeSelector inside factory.stop()), and it's preferable if you don't. The reason it's preferable not to do it is to allow the system to do a graceful shutdown. The reason you shouldn't need to do that is that the shutdown call joins on the accept and selector threads, so it's not going to return until those threads exit. The unit tests create and destroy the factories/threads/selectors in a pretty tight loop. I'm not sure how long it takes the system to close fd's associated with the selector, but maybe this is something like how closed sockets can linger. I might try to put a lengthier sleep after the factory.shutdown() call. There have been known issues with file descriptor leaks when calling selector.close(), so we might check JDK versions we're each running. I found the following bug affecting JDK5u28/6u30/7u5 http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7118373
          Hide
          thawan Thawan Kooburat added a comment -

          Thanks Jay, I was about to mention that bug as well. For our environment, we already switched to jdk7u6. I believe we ran into this JDK bug in our production, after upgrading to a newer JDK, the problem is gone.

          Show
          thawan Thawan Kooburat added a comment - Thanks Jay, I was about to mention that bug as well. For our environment, we already switched to jdk7u6. I believe we ran into this JDK bug in our production, after upgrading to a newer JDK, the problem is gone.
          Hide
          shralex Alexander Shraer added a comment -

          Hi Jay,

          and yet with this change my test in ZK-107 now passes. Here's the failure that I was getting yesterday, even after calling closeSelector where you previously suggested:

          https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1342//testReport/org.apache.zookeeper.test/ReconfigTest/testQuorumSystemChange/

          when I add the closeSelector inside factory.stop, this error goes away.

          I still got a failure because endFdCount - startFdCount is 1 on my Mac but 4 on Hudson.

          Show
          shralex Alexander Shraer added a comment - Hi Jay, and yet with this change my test in ZK-107 now passes. Here's the failure that I was getting yesterday, even after calling closeSelector where you previously suggested: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1342//testReport/org.apache.zookeeper.test/ReconfigTest/testQuorumSystemChange/ when I add the closeSelector inside factory.stop, this error goes away. I still got a failure because endFdCount - startFdCount is 1 on my Mac but 4 on Hudson.
          Hide
          shralex Alexander Shraer added a comment -

          sorry, its actually my mistake - I didn't notice the second closeSelector() in Thawan't patch, which has the same effect as what I was suggesting above. so no changes necessary to the patch.

          Show
          shralex Alexander Shraer added a comment - sorry, its actually my mistake - I didn't notice the second closeSelector() in Thawan't patch, which has the same effect as what I was suggesting above. so no changes necessary to the patch.
          Hide
          phunt Patrick Hunt added a comment -

          Committed to trunk. Thanks Thawan.

          Show
          phunt Patrick Hunt added a comment - Committed to trunk. Thanks Thawan.

            People

            • Assignee:
              thawan Thawan Kooburat
              Reporter:
              shralex Alexander Shraer
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development