ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-517

NIO factory fails to close connections when the number of file handles run out.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Duplicate
    • Affects Version/s: 3.4.3, 3.5.0
    • Fix Version/s: None
    • Component/s: server
    • Labels:
      None
    • Release Note:
      Fix NIO accept loop error handling

      Description

      The code in NIO factory is such that if we fail to accept a connection due to some reasons (too many file handles maybe one of them) we do not close the connections that are in CLOSE_WAIT. We need to call an explicit close on these sockets and then close them. One of the solutions might be to move doIO before accpet so that we can still close connection even if we cannot accept connections.

        Issue Links

          Activity

          Mahadev konar created issue -
          Patrick Hunt made changes -
          Field Original Value New Value
          Priority Major [ 3 ] Critical [ 2 ]
          Mahadev konar made changes -
          Fix Version/s 3.4.0 [ 12314469 ]
          Fix Version/s 3.3.0 [ 12313976 ]
          Hide
          Patrick Hunt added a comment -

          would be great to get this into 3.3.2

          Show
          Patrick Hunt added a comment - would be great to get this into 3.3.2
          Patrick Hunt made changes -
          Fix Version/s 3.3.2 [ 12315108 ]
          Component/s server [ 12312382 ]
          Hide
          Patrick Hunt added a comment -

          I noticed another issue in this same area:

                                  if ((k.readyOps() & SelectionKey.OP_ACCEPT) != 0) {
                                      SocketChannel sc = ((ServerSocketChannel) k
                                              .channel()).accept();
                                      InetAddress ia = sc.socket().getInetAddress();
                                      int cnxncount = getClientCnxnCount(ia);
                                      if (maxClientCnxns > 0 && cnxncount >= maxClientCnxns){
                                          LOG.warn("Too many connections from " + ia
                                                   + " - max is " + maxClientCnxns );
                                          sc.close();
          

          Notice that if macclientcnxns is exceeded we close the connection - iirc default linger time on linux is typically 60seconds. So if we have a mis-behaving client we can still run into trouble ("still" meaning we added this code to keep clients from DOS attacking the server) with TIME_WAIT state on the tcp connection.

          We need to set the linger time down to 0 in this case before closing.

          Show
          Patrick Hunt added a comment - I noticed another issue in this same area: if ((k.readyOps() & SelectionKey.OP_ACCEPT) != 0) { SocketChannel sc = ((ServerSocketChannel) k .channel()).accept(); InetAddress ia = sc.socket().getInetAddress(); int cnxncount = getClientCnxnCount(ia); if (maxClientCnxns > 0 && cnxncount >= maxClientCnxns){ LOG.warn("Too many connections from " + ia + " - max is " + maxClientCnxns ); sc.close(); Notice that if macclientcnxns is exceeded we close the connection - iirc default linger time on linux is typically 60seconds. So if we have a mis-behaving client we can still run into trouble ("still" meaning we added this code to keep clients from DOS attacking the server) with TIME_WAIT state on the tcp connection. We need to set the linger time down to 0 in this case before closing.
          Patrick Hunt made changes -
          Link This issue relates to ZOOKEEPER-801 [ ZOOKEEPER-801 ]
          Mahadev konar made changes -
          Assignee Benjamin Reed [ breed ]
          Hide
          Mahadev konar added a comment -

          moving this out to 3.3.3 and 3.4 for investigation.

          Show
          Mahadev konar added a comment - moving this out to 3.3.3 and 3.4 for investigation.
          Mahadev konar made changes -
          Fix Version/s 3.3.3 [ 12315482 ]
          Fix Version/s 3.3.2 [ 12315108 ]
          Benjamin Reed made changes -
          Fix Version/s 3.3.3 [ 12315482 ]
          Hide
          Mahadev konar added a comment -

          not a blocker. Moving it out of 3.4 release.

          Show
          Mahadev konar added a comment - not a blocker. Moving it out of 3.4 release.
          Mahadev konar made changes -
          Fix Version/s 3.5.0 [ 12316644 ]
          Fix Version/s 3.4.0 [ 12314469 ]
          Jay Shrauner made changes -
          Attachment ZOOKEEPER-517.patch [ 12522030 ]
          Hide
          Jay Shrauner added a comment -
          • Fixed accept() code to trap errors so the select handler loop doesn't skip handling any other IO requests (such as, eg, closing a socket)
          • Added rate limiting to error logging. Noticed when stressing the server with too many connections that the server was pegging on hammering the error logs
          • Reproduced issue Patrick noted and fixed closing of too many connects from single client so they wouldn't park in CLOSE_WAIT
          Show
          Jay Shrauner added a comment - Fixed accept() code to trap errors so the select handler loop doesn't skip handling any other IO requests (such as, eg, closing a socket) Added rate limiting to error logging. Noticed when stressing the server with too many connections that the server was pegging on hammering the error logs Reproduced issue Patrick noted and fixed closing of too many connects from single client so they wouldn't park in CLOSE_WAIT
          Jay Shrauner made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Release Note Fix NIO accept loop error handling
          Affects Version/s 3.4.3 [ 12319288 ]
          Affects Version/s 3.5.0 [ 12316644 ]
          Assignee Benjamin Reed [ breed ] Jay Shrauner [ shrauner ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12522030/ZOOKEEPER-517.patch
          against trunk revision 1307644.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1028//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1028//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1028//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12522030/ZOOKEEPER-517.patch against trunk revision 1307644. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1028//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1028//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1028//console This message is automatically generated.
          Hide
          Jay Shrauner added a comment -

          How tested:

          • set file descriptor limit low on server, started server
          • opened connections until no longer able to establish session
          • opened additional connections to set up large pool of pending accepts
          • closed connected session
          • verified pending accepts made it through

          For max connections per client testing, set a low value and kept opening more than that # of connections from the same IP, verified that no longer see sockets parked in CLOSE_WAIT.

          Show
          Jay Shrauner added a comment - How tested: set file descriptor limit low on server, started server opened connections until no longer able to establish session opened additional connections to set up large pool of pending accepts closed connected session verified pending accepts made it through For max connections per client testing, set a low value and kept opening more than that # of connections from the same IP, verified that no longer see sockets parked in CLOSE_WAIT.
          Hide
          Jay Shrauner added a comment -

          Superseded and made obsolete by ZOOKEEPER-1504

          Show
          Jay Shrauner added a comment - Superseded and made obsolete by ZOOKEEPER-1504
          Patrick Hunt made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Fix Version/s 3.5.0 [ 12316644 ]
          Resolution Duplicate [ 3 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Patch Available Patch Available
          957d 23h 33m 1 Jay Shrauner 10/Apr/12 00:05
          Patch Available Patch Available Resolved Resolved
          108d 6h 12m 1 Patrick Hunt 27/Jul/12 06:18

            People

            • Assignee:
              Jay Shrauner
              Reporter:
              Mahadev konar
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development