ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1049

Session expire/close flooding renders heartbeats to delay significantly

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 3.3.2
    • Fix Version/s: 3.3.4, 3.4.0
    • Component/s: server
    • Labels:
      None
    • Environment:

      CentOS 5.3, three node ZK ensemble

      Description

      Let's say we have 100 clients (group A) already connected to three-node ZK ensemble with session timeout of 15 second. And we have 1000 clients (group B) already connected to the same ZK ensemble, all watching several nodes (with 15 second session timeout)

      Consider a case in which All clients in group B suddenly hung or deadlocked (JVM OOME) all at the same time. 15 seconds later, all sessions in group B gets expired, creating session closing stampede. Depending on the number of this clients in group B, all request/response ZK ensemble should process get delayed up to 8 seconds (1000 clients we have tested).

      This delay causes some clients in group A their sessions expired due to delay in getting heartbeat response. This causes normal servers to drop out of clusters. This is a serious problem in our installation, since some of our services running batch servers or CI servers creating the same scenario as above almost everyday.

      I am attaching a graph showing ping response time delay.

      I think ordering of creating/closing sessions and ping exchange isn't important (quorum state machine). at least ping request / response should be handle independently (different queue and different thread) to keep realtime-ness of ping.

      As a workaround, we are raising session timeout to 50 seconds.
      But this causes max. failover of cluster to significantly increased, thus initial QoS we promised cannot be met.

      1. zk_ping_latency.pdf
        4 kB
        Chang Song
      2. ZOOKEEPER-1049.patch
        0.7 kB
        Chisu Ryu
      3. ZookeeperPingTest.zip
        19 kB
        Chang Song

        Issue Links

          Activity

          Hide
          Chang Song added a comment -

          Ping delay with respect to the number of clients closing session.

          Show
          Chang Song added a comment - Ping delay with respect to the number of clients closing session.
          Hide
          Chang Song added a comment -

          We have monitored ZK ensemble server. There are no significant I/O activity, this workload is CPU bound. 0-3% io wait (this is normal system I/O wait due to system log flush, page cache flush daemon). User CPU goes up to 30-40% during this closing stampede.

          We'll attach ZK config file.

          Show
          Chang Song added a comment - We have monitored ZK ensemble server. There are no significant I/O activity, this workload is CPU bound. 0-3% io wait (this is normal system I/O wait due to system log flush, page cache flush daemon). User CPU goes up to 30-40% during this closing stampede. We'll attach ZK config file.
          Hide
          Chang Song added a comment -

          Log4J log level is "error" during this test.

          Show
          Chang Song added a comment - Log4J log level is "error" during this test.
          Hide
          Chang Song added a comment -

          ZK version is 3.3.3

          Show
          Chang Song added a comment - ZK version is 3.3.3
          Hide
          Chang Song added a comment -

          tickTime=2000
          initLimit=10
          syncLimit=5

          dataDir=/???/zookeeper/data
          dataLogDir=/???/zookeeper/log

          clientPort=17288

          maxClientCnxns=0

          #globalOutstandingLimit=1000, we tried both (no difference)
          globalOutstandingLimit=50000

          1. Arcus ZooKeeper Servers
            server.1=????
            server.2=????
            server.3=1????
          Show
          Chang Song added a comment - tickTime=2000 initLimit=10 syncLimit=5 dataDir=/???/zookeeper/data dataLogDir=/???/zookeeper/log clientPort=17288 maxClientCnxns=0 #globalOutstandingLimit=1000, we tried both (no difference) globalOutstandingLimit=50000 Arcus ZooKeeper Servers server.1=???? server.2=???? server.3=1????
          Hide
          Chang Song added a comment -

          log4j.rootLogger=ERROR, ROLLINGFILE

          log4j.appender.CONSOLE=org.apache.log4j.ConsoleAppender
          log4j.appender.CONSOLE.Threshold=INFO
          log4j.appender.CONSOLE.layout=org.apache.log4j.PatternLayout
          log4j.appender.CONSOLE.layout.ConversionPattern=%d

          {ISO8601} - %-5p [%t:%C{1}@%L] - %m%n

          log4j.appender.ROLLINGFILE=org.apache.log4j.RollingFileAppender
          log4j.appender.ROLLINGFILE.Threshold=DEBUG
          log4j.appender.ROLLINGFILE.File=/???/zookeeper/log/zookeeper.log

          log4j.appender.ROLLINGFILE.MaxFileSize=50MB
          log4j.appender.ROLLINGFILE.layout=org.apache.log4j.PatternLayout
          log4j.appender.ROLLINGFILE.layout.ConversionPattern=%d{ISO8601}

          - %-5p [%t:%C

          {1}@%L] - %m%n

          log4j.appender.TRACEFILE=org.apache.log4j.FileAppender
          log4j.appender.TRACEFILE.Threshold=TRACE
          log4j.appender.TRACEFILE.File=/???/zookeeper/log/zookeeper_trace.log

          log4j.appender.TRACEFILE.layout=org.apache.log4j.PatternLayout
          log4j.appender.TRACEFILE.layout.ConversionPattern=%d{ISO8601} - %-5p [%t:%C{1}

          @%L][%x] - %m%n

          Show
          Chang Song added a comment - log4j.rootLogger=ERROR, ROLLINGFILE log4j.appender.CONSOLE=org.apache.log4j.ConsoleAppender log4j.appender.CONSOLE.Threshold=INFO log4j.appender.CONSOLE.layout=org.apache.log4j.PatternLayout log4j.appender.CONSOLE.layout.ConversionPattern=%d {ISO8601} - %-5p [%t:%C{1}@%L] - %m%n log4j.appender.ROLLINGFILE=org.apache.log4j.RollingFileAppender log4j.appender.ROLLINGFILE.Threshold=DEBUG log4j.appender.ROLLINGFILE.File=/???/zookeeper/log/zookeeper.log log4j.appender.ROLLINGFILE.MaxFileSize=50MB log4j.appender.ROLLINGFILE.layout=org.apache.log4j.PatternLayout log4j.appender.ROLLINGFILE.layout.ConversionPattern=%d{ISO8601} - %-5p [%t:%C {1}@%L] - %m%n log4j.appender.TRACEFILE=org.apache.log4j.FileAppender log4j.appender.TRACEFILE.Threshold=TRACE log4j.appender.TRACEFILE.File=/???/zookeeper/log/zookeeper_trace.log log4j.appender.TRACEFILE.layout=org.apache.log4j.PatternLayout log4j.appender.TRACEFILE.layout.ConversionPattern=%d{ISO8601} - %-5p [%t:%C{1} @%L] [%x] - %m%n
          Hide
          Chang Song added a comment -

          1) are you running in a virtualized environment?

          NO

          2) are you co-locating other services on the same host(s) that make up
          the ZK serving cluster?

          NO. Dedicated ZK ensemble

          3) have you followed the admin guide's "things to avoid"?
          http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_commonProblems
          In particular ensuring that you are not swapping or going into gc
          pause (both on the server and the client)
          a) try turning on GC logging and ensure that you are not going into GC
          pause, see the troubleshooting guide, this is the most common cause of
          high latency for the clients

          No full GC (jstat output)

          b) ensure that you are not swapping

          No swapping. No significant IO (0-3% iowait utilization)

          c) ensure that other processes are not causing log writing
          (transactional logging) to be slow.

          No other processes are running on these hosts.

          Show
          Chang Song added a comment - 1) are you running in a virtualized environment? NO 2) are you co-locating other services on the same host(s) that make up the ZK serving cluster? NO. Dedicated ZK ensemble 3) have you followed the admin guide's "things to avoid"? http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_commonProblems In particular ensuring that you are not swapping or going into gc pause (both on the server and the client) a) try turning on GC logging and ensure that you are not going into GC pause, see the troubleshooting guide, this is the most common cause of high latency for the clients No full GC (jstat output) b) ensure that you are not swapping No swapping. No significant IO (0-3% iowait utilization) c) ensure that other processes are not causing log writing (transactional logging) to be slow. No other processes are running on these hosts.
          Hide
          Chang Song added a comment -

          1) what is the connectivity like btw the servers?
          What is the ping time btw them?

          ping time is under 1ms (inter-IDC)

          Is the system perhaps loading down the network during this test,
          causing network latency to increase? Are all the nic cards (server and
          client) configured correctly? I've seen a number of cases where
          clients and/or server had incorrectly configured nics (ethtool
          reported 10 MB/sec half duplex for what should be 1gigeth)

          They all are configured Full-Duplex 1Gbps NIC.
          Switches are all operational.

          2) regarding IO, if you run 'iostat -x 2' on the zk servers while your
          issue is happening, what's the %util of the disk? what's the iowait
          look like?

          0-3% at the time of testing. It varies. There is no significant IO
          from Zookeeper process. Just normal page cache flush activity.

          Show
          Chang Song added a comment - 1) what is the connectivity like btw the servers? What is the ping time btw them? ping time is under 1ms (inter-IDC) Is the system perhaps loading down the network during this test, causing network latency to increase? Are all the nic cards (server and client) configured correctly? I've seen a number of cases where clients and/or server had incorrectly configured nics (ethtool reported 10 MB/sec half duplex for what should be 1gigeth) They all are configured Full-Duplex 1Gbps NIC. Switches are all operational. 2) regarding IO, if you run 'iostat -x 2' on the zk servers while your issue is happening, what's the %util of the disk? what's the iowait look like? 0-3% at the time of testing. It varies. There is no significant IO from Zookeeper process. Just normal page cache flush activity.
          Hide
          Chang Song added a comment -

          How to reproduce

          1. 50 clients connect to ZK ensemble watching a node with 15 sec session timeout (calling group A)

          2. Let 1000 clients in other group (called group B) connect to the ZK ensemble, but hang the JVM (or intentionally create OOM case). Again with 15 sec session timeout

          3. Watch clients in group A drop out of the cluster due to ping delay (session expired)

          Show
          Chang Song added a comment - How to reproduce 1. 50 clients connect to ZK ensemble watching a node with 15 sec session timeout (calling group A) 2. Let 1000 clients in other group (called group B) connect to the ZK ensemble, but hang the JVM (or intentionally create OOM case). Again with 15 sec session timeout 3. Watch clients in group A drop out of the cluster due to ping delay (session expired)
          Hide
          Chang Song added a comment -

          What about using multiple ZK clusters for this, then?

          Sorry, we cannot do that due to service deployment policy

          Have you tested a cluster where the machines are set up correctly with separate snapshot and log disks?

          We have two test/staging ZK ensemble setup, and one production. They reproduce the same result.
          Since IO wasn't an issue, there is no point of separating snapshot and log.

          Are your ZK machines doing any other tasks?

          This is a dedicated ZK ensemble

          Show
          Chang Song added a comment - What about using multiple ZK clusters for this, then? Sorry, we cannot do that due to service deployment policy Have you tested a cluster where the machines are set up correctly with separate snapshot and log disks? We have two test/staging ZK ensemble setup, and one production. They reproduce the same result. Since IO wasn't an issue, there is no point of separating snapshot and log. Are your ZK machines doing any other tasks? This is a dedicated ZK ensemble
          Hide
          Camille Fournier added a comment -

          This might be related to this email chain from February:

          http://mail-archives.apache.org/mod_mbox/zookeeper-user/201102.mbox/%3C6642FC1CAF133548AA8FDF497C547F0A23C0C5265B@NYWEXMBX2126.msad.ms.com%3E

          I theorized at the time that the issue might be due to synchronization on the session table, but never got enough information to finish the investigation.

          Show
          Camille Fournier added a comment - This might be related to this email chain from February: http://mail-archives.apache.org/mod_mbox/zookeeper-user/201102.mbox/%3C6642FC1CAF133548AA8FDF497C547F0A23C0C5265B@NYWEXMBX2126.msad.ms.com%3E I theorized at the time that the issue might be due to synchronization on the session table, but never got enough information to finish the investigation.
          Hide
          Chang Song added a comment -

          We have uploaded a simple reproducer scripts.

          Show
          Chang Song added a comment - We have uploaded a simple reproducer scripts.
          Hide
          Chang Song added a comment -

          I have uploaded a python-based reproducer, and test results.

          This probably is the worst iowait CPU% (you can find them in Result directory.

          avg-cpu: %user %nice %system %iowait %steal %idle
          6.15 0.00 2.05 2.56 0.00 89.23

          The resulting latency delay (with this reproducer)- You will find 2 second delay for some clients

          close session elapsed : 0ms
          close session elapsed : 0ms
          close session elapsed : 2001ms
          close session elapsed : 1ms
          close session elapsed : 0ms

          .....

          close session elapsed : 0ms
          close session elapsed : 0ms
          close session elapsed : 2001ms
          close session elapsed : 0ms
          close session elapsed : 0ms
          ....

          close session elapsed : 0ms
          close session elapsed : 2000ms
          close session elapsed : 2000ms
          close session elapsed : 0ms
          close session elapsed : 0ms

          Show
          Chang Song added a comment - I have uploaded a python-based reproducer, and test results. This probably is the worst iowait CPU% (you can find them in Result directory. avg-cpu: %user %nice %system %iowait %steal %idle 6.15 0.00 2.05 2.56 0.00 89.23 The resulting latency delay (with this reproducer)- You will find 2 second delay for some clients close session elapsed : 0ms close session elapsed : 0ms close session elapsed : 2001ms close session elapsed : 1ms close session elapsed : 0ms ..... close session elapsed : 0ms close session elapsed : 0ms close session elapsed : 2001ms close session elapsed : 0ms close session elapsed : 0ms .... close session elapsed : 0ms close session elapsed : 2000ms close session elapsed : 2000ms close session elapsed : 0ms close session elapsed : 0ms
          Hide
          Chang Song added a comment -

          When this happens
          Below thread becomes BLOCKED

          "NIOServerCxn.Factory:0.0.0.0/0.0.0.0:17288" - Thread t@11
          java.lang.Thread.State: BLOCKED on org.apache.zookeeper.server.NIOServerCnxn$Factory@399831c3 owned by: CommitProcessor:3
          at org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:234)
          Locked ownable synchronizers:

          • None

          If you look at CommitProcessor:3

          "CommitProcessor:3" - Thread t@27
          java.lang.Thread.State: RUNNABLE
          at sun.nio.ch.FileDispatcher.preClose0(Native Method)
          at sun.nio.ch.SocketDispatcher.preClose(SocketDispatcher.java:41)
          at sun.nio.ch.SocketChannelImpl.implCloseSelectableChannel(SocketChannelImpl.java:684)

          • locked java.lang.Object@@3186c91
            at java.nio.channels.spi.AbstractSelectableChannel.implCloseChannel(AbstractSelectableChannel.java:201)
            at java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:97)
          • locked java.lang.Object@@7fc600
            at sun.nio.ch.SocketAdaptor.close(SocketAdaptor.java:352)
            at org.apache.zookeeper.server.NIOServerCnxn.closeSock(NIOServerCnxn.java:1463)
            at org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:1412)
          • locked java.util.HashSet@@4b94a39
            at org.apache.zookeeper.server.NIOServerCnxn$Factory.closeSessionWithoutWakeup(NIOServerCnxn.java:343)
            at org.apache.zookeeper.server.NIOServerCnxn$Factory.closeSession(NIOServerCnxn.java:330)
          • locked org.apache.zookeeper.server.NIOServerCnxn$Factory@@399831c
            at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:133)
            at org.apache.zookeeper.server.quorum.Leader$ToBeAppliedRequestProcessor.processRequest(Leader.java:540)
            at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
            Locked ownable synchronizers:
          • None

          scxn.closeSession(request.sessionId)
          if (request.hdr != null && request.hdr.getType() == OpCode.closeSession) {
          Factory scxn = zks.getServerCnxnFactory();
          // this might be possible since
          // we might just be playing diffs from the leader
          if (scxn != null && request.cnxn == null)

          { // calling this if we have the cnxn results in the client's // close session response being lost - we've already closed // the session/socket here before we can send the closeSession // in the switch block below scxn.closeSession(request.sessionId); return; }

          }

          Below is a synchronized method

          NIOServerCnxn.Factory.closeSession()
          synchronized void closeSession(long sessionId)

          { selector.wakeup(); closeSessionWithoutWakeup(sessionId); }

          @SuppressWarnings("unchecked")
          private void closeSessionWithoutWakeup(long sessionId) {
          HashSet<NIOServerCnxn> cnxns;
          synchronized (this.cnxns)

          { cnxns = (HashSet<NIOServerCnxn>)this.cnxns.clone(); }

          for (NIOServerCnxn cnxn : cnxns) {
          if (cnxn.sessionId == sessionId) {
          try

          { cnxn.close(); }

          catch (Exception e)

          { LOG.warn("exception during session close", e); }

          break;
          }
          }
          }

          We measured the time spent in scxn.closeSession(request.sessionId), and the result is in the previous post.

          if (request.hdr != null && request.hdr.getType() == OpCode.closeSession) {
          Factory scxn = zks.getServerCnxnFactory();

          if (scxn != null && request.cnxn == null)

          { scxn.closeSession(request.sessionId); return; }

          }

          Show
          Chang Song added a comment - When this happens Below thread becomes BLOCKED "NIOServerCxn.Factory:0.0.0.0/0.0.0.0:17288" - Thread t@11 java.lang.Thread.State: BLOCKED on org.apache.zookeeper.server.NIOServerCnxn$Factory@399831c3 owned by: CommitProcessor:3 at org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:234) Locked ownable synchronizers: None If you look at CommitProcessor:3 "CommitProcessor:3" - Thread t@27 java.lang.Thread.State: RUNNABLE at sun.nio.ch.FileDispatcher.preClose0(Native Method) at sun.nio.ch.SocketDispatcher.preClose(SocketDispatcher.java:41) at sun.nio.ch.SocketChannelImpl.implCloseSelectableChannel(SocketChannelImpl.java:684) locked java.lang.Object@@3186c91 at java.nio.channels.spi.AbstractSelectableChannel.implCloseChannel(AbstractSelectableChannel.java:201) at java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:97) locked java.lang.Object@@7fc600 at sun.nio.ch.SocketAdaptor.close(SocketAdaptor.java:352) at org.apache.zookeeper.server.NIOServerCnxn.closeSock(NIOServerCnxn.java:1463) at org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:1412) locked java.util.HashSet@@4b94a39 at org.apache.zookeeper.server.NIOServerCnxn$Factory.closeSessionWithoutWakeup(NIOServerCnxn.java:343) at org.apache.zookeeper.server.NIOServerCnxn$Factory.closeSession(NIOServerCnxn.java:330) locked org.apache.zookeeper.server.NIOServerCnxn$Factory@@399831c at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:133) at org.apache.zookeeper.server.quorum.Leader$ToBeAppliedRequestProcessor.processRequest(Leader.java:540) at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73) Locked ownable synchronizers: None scxn.closeSession(request.sessionId) if (request.hdr != null && request.hdr.getType() == OpCode.closeSession) { Factory scxn = zks.getServerCnxnFactory(); // this might be possible since // we might just be playing diffs from the leader if (scxn != null && request.cnxn == null) { // calling this if we have the cnxn results in the client's // close session response being lost - we've already closed // the session/socket here before we can send the closeSession // in the switch block below scxn.closeSession(request.sessionId); return; } } Below is a synchronized method NIOServerCnxn.Factory.closeSession() synchronized void closeSession(long sessionId) { selector.wakeup(); closeSessionWithoutWakeup(sessionId); } @SuppressWarnings("unchecked") private void closeSessionWithoutWakeup(long sessionId) { HashSet<NIOServerCnxn> cnxns; synchronized (this.cnxns) { cnxns = (HashSet<NIOServerCnxn>)this.cnxns.clone(); } for (NIOServerCnxn cnxn : cnxns) { if (cnxn.sessionId == sessionId) { try { cnxn.close(); } catch (Exception e) { LOG.warn("exception during session close", e); } break; } } } We measured the time spent in scxn.closeSession(request.sessionId), and the result is in the previous post. if (request.hdr != null && request.hdr.getType() == OpCode.closeSession) { Factory scxn = zks.getServerCnxnFactory(); if (scxn != null && request.cnxn == null) { scxn.closeSession(request.sessionId); return; } }
          Hide
          Chang Song added a comment -

          I think we found a culprit

          public NIOServerCnxn(ZooKeeperServer zk, SocketChannel sock,
                      SelectionKey sk, Factory factory) throws IOException {
                  this.zk = zk;
                  this.sock = sock;
                  this.sk = sk;
                  this.factory = factory;
                  sock.socket().setTcpNoDelay(true);
                  sock.socket().setSoLinger(true, 2); // socket linger option of 
                  InetAddress addr = ((InetSocketAddress) sock.socket()
                          .getRemoteSocketAddress()).getAddress();
                  authInfo.add(new Id("ip", addr.getHostAddress()));
                  sk.interestOps(SelectionKey.OP_READ);
              }
          

          it is the socket linger option.
          since clients session already expired, do we need to wait 2 second to wait until clients to consume socket receive buffer?

          we have set to linger option to 0 to send TCP RESET, and the problem went away.

          We'll try to test original case, and report back.

          Show
          Chang Song added a comment - I think we found a culprit public NIOServerCnxn(ZooKeeperServer zk, SocketChannel sock, SelectionKey sk, Factory factory) throws IOException { this .zk = zk; this .sock = sock; this .sk = sk; this .factory = factory; sock.socket().setTcpNoDelay( true ); sock.socket().setSoLinger( true , 2); // socket linger option of InetAddress addr = ((InetSocketAddress) sock.socket() .getRemoteSocketAddress()).getAddress(); authInfo.add( new Id( "ip" , addr.getHostAddress())); sk.interestOps(SelectionKey.OP_READ); } it is the socket linger option. since clients session already expired, do we need to wait 2 second to wait until clients to consume socket receive buffer? we have set to linger option to 0 to send TCP RESET, and the problem went away. We'll try to test original case, and report back.
          Hide
          Chang Song added a comment -

          We just verified that the original problem goes away when linger option is off (0)
          Mystery solved.

          We will patch our production server right away.

          Show
          Chang Song added a comment - We just verified that the original problem goes away when linger option is off (0) Mystery solved. We will patch our production server right away.
          Hide
          Mahadev konar added a comment -

          Chang,
          I am quite confused. Not sure how setting the tcp linger time to 0 is helping you? What OS are you using? I find it surprising that setting the linger time to be 0 doesnt delay packets to other sockets.

          Show
          Mahadev konar added a comment - Chang, I am quite confused. Not sure how setting the tcp linger time to 0 is helping you? What OS are you using? I find it surprising that setting the linger time to be 0 doesnt delay packets to other sockets.
          Hide
          Mahadev konar added a comment -

          Ben and I looked at the man page of socket and here is what it says:

          SO_LINGER
          Sets or gets the SO_LINGER option. The argument is a linger structure.
          struct linger {
              int l_onoff;    /* linger active */
              int l_linger;   /* how many seconds to linger for */
          };
          When enabled, a close(2) or shutdown(2) will not return until all queued messages for the socket have been successfully sent or the linger timeout has been reached. Otherwise, the call returns immediately and the closing is done in the background. When the socket is closed as part of exit(2), it always lingers in the background.
          

          So, since we have a single thread that calls close() and IO on sockets, it is possible that this can cause a huge delay in IO for other sockets. Anyone else hasmore information on this please feel free to update!

          Show
          Mahadev konar added a comment - Ben and I looked at the man page of socket and here is what it says: SO_LINGER Sets or gets the SO_LINGER option. The argument is a linger structure. struct linger { int l_onoff; /* linger active */ int l_linger; /* how many seconds to linger for */ }; When enabled, a close(2) or shutdown(2) will not return until all queued messages for the socket have been successfully sent or the linger timeout has been reached. Otherwise, the call returns immediately and the closing is done in the background. When the socket is closed as part of exit(2), it always lingers in the background. So, since we have a single thread that calls close() and IO on sockets, it is possible that this can cause a huge delay in IO for other sockets. Anyone else hasmore information on this please feel free to update!
          Hide
          Ted Dunning added a comment -

          Is it possible that somebody on the server side has a lock while trying to close a socket to a failed machine?

          Or is it possible that there is a limited number of threads available for this closing activity?

          Show
          Ted Dunning added a comment - Is it possible that somebody on the server side has a lock while trying to close a socket to a failed machine? Or is it possible that there is a limited number of threads available for this closing activity?
          Hide
          Chang Song added a comment -

          OK.

          Linger option with 2 second timeout on Apache web server works because it is connection per process/thread model. Blocking in one process does not block other request processing.

          If you look at the code below, I don't see why it shouldn't block.
          One 2 sec delay causes some other closing activity to 4 or 6 or 8 seconds, and so forth.

          NIOServerCnxn.Factory.closeSession()
          synchronized void closeSession(long sessionId) { selector.wakeup(); closeSessionWithoutWakeup(sessionId); }
          
          @SuppressWarnings("unchecked")
          private void closeSessionWithoutWakeup(long sessionId) {
            HashSet<NIOServerCnxn> cnxns;
            synchronized (this.cnxns) { cnxns = (HashSet<NIOServerCnxn>)this.cnxns.clone(); }
          
            for (NIOServerCnxn cnxn : cnxns) {
              if (cnxn.sessionId == sessionId) {
                try { cnxn.close(); } 
                catch (Exception e) { LOG.warn("exception during session close", e); }
                break;
              }
            }
          }
          

          One thing I need to make correction, my team didn't set linger to 0.
          They disabled the linger option, close() returns immediately (lingering done by TCP stack)

          This is different than setting linger to 0, which simply flush all the buffers and send TCP RST.
          By disabling linger option, system will hold on to the socket in FIN_WAIT_1 state for dead clients.
          note that we have no timeout for FIN_WAIT_1 sockets.

          We'll have to experiment with setting linger option 0 to move socket from ESTABLISHED to CLOSED states immediately.

          Show
          Chang Song added a comment - OK. Linger option with 2 second timeout on Apache web server works because it is connection per process/thread model. Blocking in one process does not block other request processing. If you look at the code below, I don't see why it shouldn't block. One 2 sec delay causes some other closing activity to 4 or 6 or 8 seconds, and so forth. NIOServerCnxn.Factory.closeSession() synchronized void closeSession( long sessionId) { selector.wakeup(); closeSessionWithoutWakeup(sessionId); } @SuppressWarnings( "unchecked" ) private void closeSessionWithoutWakeup( long sessionId) { HashSet<NIOServerCnxn> cnxns; synchronized ( this .cnxns) { cnxns = (HashSet<NIOServerCnxn>) this .cnxns.clone(); } for (NIOServerCnxn cnxn : cnxns) { if (cnxn.sessionId == sessionId) { try { cnxn.close(); } catch (Exception e) { LOG.warn( "exception during session close" , e); } break ; } } } One thing I need to make correction, my team didn't set linger to 0. They disabled the linger option, close() returns immediately (lingering done by TCP stack) This is different than setting linger to 0, which simply flush all the buffers and send TCP RST. By disabling linger option, system will hold on to the socket in FIN_WAIT_1 state for dead clients. note that we have no timeout for FIN_WAIT_1 sockets. We'll have to experiment with setting linger option 0 to move socket from ESTABLISHED to CLOSED states immediately.
          Hide
          Chang Song added a comment -

          I don't see any problem with disabling linger option due to net.ipv4.tcp_orphan_retries being 200ms on Linux. socket without lingering option to dead server closed to FIN_WAIT_1 setting times out immediately (after 200ms) and immediately CLOSED.

          So our final fix is

          setSoLinger(false, -1)

          Please consider this fix for next release.
          Thank you.

          Show
          Chang Song added a comment - I don't see any problem with disabling linger option due to net.ipv4.tcp_orphan_retries being 200ms on Linux. socket without lingering option to dead server closed to FIN_WAIT_1 setting times out immediately (after 200ms) and immediately CLOSED. So our final fix is setSoLinger(false, -1) Please consider this fix for next release. Thank you.
          Hide
          Chisu Ryu added a comment -

          This patch set NIOServerCnxn linger option to 'false, -1'.

          Show
          Chisu Ryu added a comment - This patch set NIOServerCnxn linger option to 'false, -1'.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12477238/ZOOKEEPER-1049.patch
          against trunk revision 1091841.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/235//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/235//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/235//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12477238/ZOOKEEPER-1049.patch against trunk revision 1091841. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/235//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/235//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/235//console This message is automatically generated.
          Hide
          Mahadev konar added a comment -

          I just committed this! Very nice catch! thanks chang!

          Show
          Mahadev konar added a comment - I just committed this! Very nice catch! thanks chang!
          Hide
          Mahadev konar added a comment -

          BTW, I added a tiny comment with the patch so that its clear for others.

          Show
          Mahadev konar added a comment - BTW, I added a tiny comment with the patch so that its clear for others.
          Hide
          Chang Song added a comment -

          Kudos to "Chisu Ryu", who found the root cause and came up with the fix.
          Thank you.

          Show
          Chang Song added a comment - Kudos to "Chisu Ryu", who found the root cause and came up with the fix. Thank you.
          Hide
          Hudson added a comment -

          Integrated in ZooKeeper-trunk #1172 (See https://builds.apache.org/hudson/job/ZooKeeper-trunk/1172/)
          ZOOKEEPER-1049. Session expire/close flooding renders heartbeats to delay significantly. (chang song via mahadev)

          Show
          Hudson added a comment - Integrated in ZooKeeper-trunk #1172 (See https://builds.apache.org/hudson/job/ZooKeeper-trunk/1172/ ) ZOOKEEPER-1049 . Session expire/close flooding renders heartbeats to delay significantly. (chang song via mahadev)
          Hide
          Patrick Hunt added a comment -

          Recently a couple of users have reported this - I'm including it in 3.3.4 as well.

          Show
          Patrick Hunt added a comment - Recently a couple of users have reported this - I'm including it in 3.3.4 as well.

            People

            • Assignee:
              Chang Song
              Reporter:
              Chang Song
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development