ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-517

NIO factory fails to close connections when the number of file handles run out.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Duplicate
    • Affects Version/s: 3.4.3, 3.5.0
    • Fix Version/s: None
    • Component/s: server
    • Labels:
      None
    • Release Note:
      Fix NIO accept loop error handling

      Description

      The code in NIO factory is such that if we fail to accept a connection due to some reasons (too many file handles maybe one of them) we do not close the connections that are in CLOSE_WAIT. We need to call an explicit close on these sockets and then close them. One of the solutions might be to move doIO before accpet so that we can still close connection even if we cannot accept connections.

        Issue Links

          Activity

          Hide
          Patrick Hunt added a comment -

          would be great to get this into 3.3.2

          Show
          Patrick Hunt added a comment - would be great to get this into 3.3.2
          Hide
          Patrick Hunt added a comment -

          I noticed another issue in this same area:

                                  if ((k.readyOps() & SelectionKey.OP_ACCEPT) != 0) {
                                      SocketChannel sc = ((ServerSocketChannel) k
                                              .channel()).accept();
                                      InetAddress ia = sc.socket().getInetAddress();
                                      int cnxncount = getClientCnxnCount(ia);
                                      if (maxClientCnxns > 0 && cnxncount >= maxClientCnxns){
                                          LOG.warn("Too many connections from " + ia
                                                   + " - max is " + maxClientCnxns );
                                          sc.close();
          

          Notice that if macclientcnxns is exceeded we close the connection - iirc default linger time on linux is typically 60seconds. So if we have a mis-behaving client we can still run into trouble ("still" meaning we added this code to keep clients from DOS attacking the server) with TIME_WAIT state on the tcp connection.

          We need to set the linger time down to 0 in this case before closing.

          Show
          Patrick Hunt added a comment - I noticed another issue in this same area: if ((k.readyOps() & SelectionKey.OP_ACCEPT) != 0) { SocketChannel sc = ((ServerSocketChannel) k .channel()).accept(); InetAddress ia = sc.socket().getInetAddress(); int cnxncount = getClientCnxnCount(ia); if (maxClientCnxns > 0 && cnxncount >= maxClientCnxns){ LOG.warn("Too many connections from " + ia + " - max is " + maxClientCnxns ); sc.close(); Notice that if macclientcnxns is exceeded we close the connection - iirc default linger time on linux is typically 60seconds. So if we have a mis-behaving client we can still run into trouble ("still" meaning we added this code to keep clients from DOS attacking the server) with TIME_WAIT state on the tcp connection. We need to set the linger time down to 0 in this case before closing.
          Hide
          Mahadev konar added a comment -

          moving this out to 3.3.3 and 3.4 for investigation.

          Show
          Mahadev konar added a comment - moving this out to 3.3.3 and 3.4 for investigation.
          Hide
          Mahadev konar added a comment -

          not a blocker. Moving it out of 3.4 release.

          Show
          Mahadev konar added a comment - not a blocker. Moving it out of 3.4 release.
          Hide
          Jay Shrauner added a comment -
          • Fixed accept() code to trap errors so the select handler loop doesn't skip handling any other IO requests (such as, eg, closing a socket)
          • Added rate limiting to error logging. Noticed when stressing the server with too many connections that the server was pegging on hammering the error logs
          • Reproduced issue Patrick noted and fixed closing of too many connects from single client so they wouldn't park in CLOSE_WAIT
          Show
          Jay Shrauner added a comment - Fixed accept() code to trap errors so the select handler loop doesn't skip handling any other IO requests (such as, eg, closing a socket) Added rate limiting to error logging. Noticed when stressing the server with too many connections that the server was pegging on hammering the error logs Reproduced issue Patrick noted and fixed closing of too many connects from single client so they wouldn't park in CLOSE_WAIT
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12522030/ZOOKEEPER-517.patch
          against trunk revision 1307644.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1028//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1028//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1028//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12522030/ZOOKEEPER-517.patch against trunk revision 1307644. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1028//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1028//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1028//console This message is automatically generated.
          Hide
          Jay Shrauner added a comment -

          How tested:

          • set file descriptor limit low on server, started server
          • opened connections until no longer able to establish session
          • opened additional connections to set up large pool of pending accepts
          • closed connected session
          • verified pending accepts made it through

          For max connections per client testing, set a low value and kept opening more than that # of connections from the same IP, verified that no longer see sockets parked in CLOSE_WAIT.

          Show
          Jay Shrauner added a comment - How tested: set file descriptor limit low on server, started server opened connections until no longer able to establish session opened additional connections to set up large pool of pending accepts closed connected session verified pending accepts made it through For max connections per client testing, set a low value and kept opening more than that # of connections from the same IP, verified that no longer see sockets parked in CLOSE_WAIT.
          Hide
          Jay Shrauner added a comment -

          Superseded and made obsolete by ZOOKEEPER-1504

          Show
          Jay Shrauner added a comment - Superseded and made obsolete by ZOOKEEPER-1504

            People

            • Assignee:
              Jay Shrauner
              Reporter:
              Mahadev konar
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development