Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-4613

Zookeper client suddenly stops sending ping requests in idle state

Add voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 3.5.7
    • None
    • java client
    • None

    Description

      Expected behavior: from documentation: "The session is kept alive by requests sent by the client. If the session is idle for a period of time that would timeout the session, the client will send a PING request to keep the session alive."
      Actual behavior: periodically client not sending any ping after last successfuly request.

       

      Client operates as expected (handling server notifications and sending pings during idle periods). But after handle another notification it suddenly stops sending any ping (no matter how long idle interval last). Client sleeps until external events is happend (server notification, or program call).
      After wakeup client starts to check session timeouts, detects long pause and triggers connection to state SUSPENDED. Client "closing socket connection and attempting reconnect". If period of inactivity was smaller than the session timeout (session is alive) connection triggers to state RECONNECTED, otherwise new seession created and connection goes to state LOST and then to state RECONNECTED. 

       

      In the attached log:
      1) We have multiple clients working in one JVM.
      2) There is a number of zookeeper servers in a cluster.
      3) Zookeeper client with seession 0x3008266486f2172 connected to 10.36.193.111:2181 with negotiated timeout = 30_000
      4) Client handles last notification at:

       

      15:19:48,874 (mt-[15H15M11S3617]-[module-3]-SendThread(10.36.193.111:2181)) DEBUG org.apache.zookeeper.ClientCnxn$SendThread:923 - Reading reply sessionid:0x3008266486f2172, packet:: clientPath:/tc/default/supervisors/wf_1057 serverPath:/tc/default/supervisors/wf_1057 finished:false header:: 6131,4  replyHeader:: 6131,42964550082,0  request:: '/tc/default/supervisors/wf_1057,T  response:: #***,s{42964550082,42964550082,1663244388854,1663244388854,0,0,0,144258549042716267,897,0,42964550082}

       

      5) Client shows no activity from this moment until watcher notification from server (client wakeups and detects session timeout):

       

      15:20:08,884 INFO  [stdout] (mt-[15H15M11S3617]-[module-3]-SendThread(10.36.193.111:2181))  15:20:08 WARN  org.apache.zookeeper.ClientCnxn$SendThread:1190 - Client session timed out, have not heard from server in 20010ms for sessionid 0x3008266486f2172
      15:20:08,885 INFO  [stdout] (mt-[15H15M11S3617]-[module-3]-SendThread(10.36.193.111:2181))  15:20:08 INFO  org.apache.zookeeper.ClientCnxn$SendThread:1238 - Client session timed out, have not heard from server in 20010ms for sessionid 0x3008266486f2172, closing socket connection and attempting reconnect

       

      6) Ping period = 10 seconds (actualy there were no pings in 20 seconds).
      7) Annother clients (running in the same jvm and connected to the same zk server) operate normaly while broken client do nothing.

       

      15:19:58,885 INFO  [stdout] (mt-[15H15M11S3617]-[module-5]-SendThread(10.36.193.111:2181))  15:19:58 DEBUG org.apache.zookeeper.ClientCnxn$SendThread:821 - Got ping response for sessionid: 0x3008266486f216b after 0ms
      15:19:58,975 INFO  [stdout] (mt-[15H15M11S3617]-[module-4]-SendThread(10.36.193.111:2181))  15:19:58 DEBUG org.apache.zookeeper.ClientCnxn$SendThread:821 - Got ping response for sessionid: 0x3008266486f2171 after 1ms

       

      So we can claim: 

      • there is no long GC pauses on client (we also has GC log for period of incident without any issues)
      • there is no problem with zk server (no reboots etc, cause it successfuly handles another clints requests)    

      We continuously see this behavior after zk version upgrade:
      zk client 3.4.8 ->  3.5.7 (curator 2.12.0 -> 4.3.0)
      zk server upgrade to 3.5.8 (runing with default configuration)

      We have state recovery after session exiration - so connection loss trigger nodes recreation. Another clients react to add/remove events (doing some computations). Having thousands of clients repeadly reconnecting - we get anomaly work load even in idle state...

       

      Attachments

        1. clint-lost-ping.log
          82 kB
          Viktor Feklin

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            tipame Viktor Feklin

            Dates

              Created:
              Updated:

              Slack

                Issue deployment