1. ZooKeeper
  2. ZOOKEEPER-1049

Session expire/close flooding renders heartbeats to delay significantly


    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 3.3.2
    • Fix Version/s: 3.3.4, 3.4.0
    • Component/s: server
    • Labels:
    • Environment:

      CentOS 5.3, three node ZK ensemble


      Let's say we have 100 clients (group A) already connected to three-node ZK ensemble with session timeout of 15 second. And we have 1000 clients (group B) already connected to the same ZK ensemble, all watching several nodes (with 15 second session timeout)

      Consider a case in which All clients in group B suddenly hung or deadlocked (JVM OOME) all at the same time. 15 seconds later, all sessions in group B gets expired, creating session closing stampede. Depending on the number of this clients in group B, all request/response ZK ensemble should process get delayed up to 8 seconds (1000 clients we have tested).

      This delay causes some clients in group A their sessions expired due to delay in getting heartbeat response. This causes normal servers to drop out of clusters. This is a serious problem in our installation, since some of our services running batch servers or CI servers creating the same scenario as above almost everyday.

      I am attaching a graph showing ping response time delay.

      I think ordering of creating/closing sessions and ping exchange isn't important (quorum state machine). at least ping request / response should be handle independently (different queue and different thread) to keep realtime-ness of ping.

      As a workaround, we are raising session timeout to 50 seconds.
      But this causes max. failover of cluster to significantly increased, thus initial QoS we promised cannot be met.

      1. ZOOKEEPER-1049.patch
        0.7 kB
        Chisu Ryu
      2. ZookeeperPingTest.zip
        19 kB
        Chang Song
      3. zk_ping_latency.pdf
        4 kB
        Chang Song

        Issue Links


          Chang Song created issue -
          Chang Song made changes -
          Field Original Value New Value
          Attachment zk_ping_latency.pdf [ 12476516 ]
          Chang Song made changes -
          Attachment ZookeeperPingTest.zip [ 12476696 ]
          Chisu Ryu made changes -
          Attachment ZOOKEEPER-1049.patch [ 12477238 ]
          Benjamin Reed made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Mahadev konar made changes -
          Fix Version/s 3.4.0 [ 12314469 ]
          Assignee Chang Song [ tru64ufs ]
          Mahadev konar made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Patrick Hunt made changes -
          Fix Version/s 3.3.4 [ 12316276 ]
          Patrick Hunt made changes -
          Link This issue relates to ZOOKEEPER-1238 [ ZOOKEEPER-1238 ]
          Mahadev konar made changes -
          Status Resolved [ 5 ] Closed [ 6 ]


            • Assignee:
              Chang Song
              Chang Song
            • Votes:
              0 Vote for this issue
              4 Start watching this issue


              • Created: