Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-4288

zkServer runok responese 'imok' but any request(include zkCli.sh) responese 'client session timed out'

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Blocker
    • Resolution: Unresolved
    • 3.5.5
    • None
    • server
    • None

    Description

      I'm running Zookeeper from the 3.5.5 docker image in k8s. 3 pod are located on three nodes,which communicates with each other with k8s cluset service. I did the following operates: shut down node 2 for 3 hours and then joined the cluster. At this time, I found that zookeeper could not respond to* any requests(include zkCli.sh)*. I used the runok and mntr commands to test the server, and only found one exception: outstanding_requests is abnormally high and keeps growing (I have a Kafka cluster using it)

       

      my cluster:

       

      //my zookeeper pods
      [root@node1 ~]# kubectl get po -nsa -o wide | grep zookeeper-dp
      itoa-zookeeper-dp1-5bc799c97f-t7wpz                        1/1     Running            0          17h   177.177.166.244   node1   <none>           <none>
      itoa-zookeeper-dp2-6d89756876-k2ds2                        1/1     Running            0          13h   177.177.104.35    node2   <none>           <none>
      itoa-zookeeper-dp3-b55ddd6c7-8h5p7                         1/1     Running            0          16h   177.177.135.54    node3   <none>           <none>
      
      //my zookeeper cluster svc
      [root@node1 ~]# kubectl get svc -nsa -o wide | grep zookeeper-dp                                                                                      44h   node=itoa-zookeeper-dp3
      itoa-zookeeper-service1                          ClusterIP   10.96.235.75    <none>        2181/TCP,2888/TCP,3888/TCP                                                                          44h   node=itoa-zookeeper-dp1
      itoa-zookeeper-service2                          ClusterIP   10.96.161.122   <none>        2181/TCP,2888/TCP,3888/TCP                                                                          44h   node=itoa-zookeeper-dp2
      itoa-zookeeper-service3                          ClusterIP   10.96.67.1      <none>        2181/TCP,2888/TCP,3888/TCP                                                                          44h   node=itoa-zookeeper-dp3

       

       

      my cluster conf:

       

      [root@node1 ~]# echo conf | nc 177.177.166.244 2181
      clientPort=2181
      secureClientPort=-1
      dataDir=/opt/zookeeper/data/version-2
      dataDirSize=135540411
      dataLogDir=/opt/zookeeper/data/version-2
      dataLogSize=135540411
      tickTime=2000
      maxClientCnxns=60
      minSessionTimeout=4000
      maxSessionTimeout=40000
      serverId=1
      initLimit=10
      syncLimit=5
      electionAlg=3
      electionPort=3888
      quorumPort=2888
      peerType=0
      membership:
      server.1=0.0.0.0:2888:3888:participant
      server.2=itoa-zookeeper-service2:2888:3888:participant
      server.3=itoa-zookeeper-service3:2888:3888:participant
      version=0
      

       

      runok & mntr response

       

      //177.177.166.244
      
      [root@node1 ~]# echo ruok | nc 177.177.166.244 2181 && echo mntr | nc 177.177.166.244 2181
      imokzk_version  3.5.5-390fe37ea45dee01bf87dc1c042b5e3dcce88653, built on 05/03/2019 12:07 GMT
      zk_avg_latency  2
      zk_max_latency  2
      zk_min_latency  2
      zk_packets_received     740
      zk_packets_sent 105
      zk_num_alive_connections        1
      zk_outstanding_requests 634
      zk_server_state follower
      zk_znode_count  8092
      zk_watch_count  0
      zk_ephemerals_count     5
      zk_approximate_data_size        708815
      zk_open_file_descriptor_count   100
      zk_max_file_descriptor_count    1048576
      
      //177.177.135.54
      [root@node1 ~]# echo ruok | nc 177.177.135.54 2181 && echo mntr | nc 177.177.135.54 2181
      imokzk_version  3.5.5-390fe37ea45dee01bf87dc1c042b5e3dcce88653, built on 05/03/2019 12:07 GMT
      zk_avg_latency  0
      zk_max_latency  0
      zk_min_latency  0
      zk_packets_received     727
      zk_packets_sent 107
      zk_num_alive_connections        1
      zk_outstanding_requests 619
      zk_server_state leader
      zk_znode_count  8092
      zk_watch_count  0
      zk_ephemerals_count     5
      zk_approximate_data_size        708815
      zk_open_file_descriptor_count   101
      zk_max_file_descriptor_count    1048576
      zk_followers    1
      zk_synced_followers     1
      zk_pending_syncs        0
      zk_last_proposal_size   32
      zk_max_proposal_size    162
      zk_min_proposal_size    32
      
      

       

      zkCli..sh responese 'client session timed out'

       

      [zk: itoa-zookeeper-service1:2181,itoa-zookeeper-service2:2181,itoa-zookeeper-service3:2181(CONNECTING) 0] ls /
      2021-04-29 11:08:22,459 [myid:itoa-zookeeper-service3:2181] - WARN  [main-SendThread(itoa-zookeeper-service3:2181):ClientCnxn$SendThread@1190] - Client session timed out, have not heard from server in 10007ms for sessionid 0x0
      2021-04-29 11:08:22,459 [myid:itoa-zookeeper-service3:2181] - INFO  [main-SendThread(itoa-zookeeper-service3:2181):ClientCnxn$SendThread@1238] - Client session timed out, have not heard from server in 10007ms for sessionid 0x0, closing socket connection and attempting reconnect
      KeeperErrorCode = ConnectionLoss for /
      [zk: itoa-zookeeper-service1:2181,itoa-zookeeper-service2:2181,itoa-zookeeper-service3:2181(CONNECTING) 1] 2021-04-29 11:08:22,743 [myid:itoa-zookeeper-service2:2181] - INFO  [main-SendThread(itoa-zookeeper-service2:2181):ClientCnxn$SendThread@1112] - Opening socket connection to server itoa-zookeeper-service2/10.96.161.122:2181. Will not attempt to authenticate using SASL (unknown error)
      2021-04-29 11:08:22,744 [myid:itoa-zookeeper-service2:2181] - INFO  [main-SendThread(itoa-zookeeper-service2:2181):ClientCnxn$SendThread@959] - Socket connection established, initiating session, client: /177.177.166.244:45456, server: itoa-zookeeper-service2/10.96.161.122:2181
      2021-04-29 11:08:22,747 [myid:itoa-zookeeper-service2:2181] - INFO  [main-SendThread(itoa-zookeeper-service2:2181):ClientCnxn$SendThread@1240] - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
      2021-04-29 11:08:23,761 [myid:itoa-zookeeper-service1:2181] - INFO  [main-SendThread(itoa-zookeeper-service1:2181):ClientCnxn$SendThread@1112] - Opening socket connection to server itoa-zookeeper-service1/10.96.235.75:2181. Will not attempt to authenticate using SASL (unknown error)
      2021-04-29 11:08:23,762 [myid:itoa-zookeeper-service1:2181] - INFO  [main-SendThread(itoa-zookeeper-service1:2181):ClientCnxn$SendThread@959] - Socket connection established, initiating session, client: /177.177.166.244:42428, server: itoa-zookeeper-service1/10.96.235.75:2181
      

       

      I shutted down node2 at 2021-04-28 17:12,uped at  20:40.  my pod logs see in the attachment(Due to the limitation of jira, I split the log of pod2)

      These other information I collected:

      • The tcp link information of one of the pods(no abnormalities in 3 pods):

       

      [root@node1 ~]# nsenter -n -t $(docker inspect -f '{{.State.Pid}}' $(docker ps | grep zookeeper-dp1 | grep -v pause | awk '{print $1}')) netstat -plan
      Active Internet connections (servers and established)
      Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
      tcp 0 0 0.0.0.0:35883 0.0.0.0:* LISTEN 109187/java
      tcp 0 0 0.0.0.0:3888 0.0.0.0:* LISTEN 109187/java
      tcp 0 0 0.0.0.0:8080 0.0.0.0:* LISTEN 109187/java
      tcp 0 0 0.0.0.0:2181 0.0.0.0:* LISTEN 109187/java
      tcp 0 0 177.177.166.244:3888 177.177.104.35:57886 ESTABLISHED 109187/java
      tcp 0 0 177.177.166.244:42204 10.96.67.1:2888 ESTABLISHED 109187/java
      tcp 0 0 177.177.166.244:3888 177.177.135.54:44462 ESTABLISHED 109187/java
      Active UNIX domain sockets (servers and established)
      Proto RefCnt Flags Type State I-Node PID/Program name Path
      unix 2 [ ACC ] STREAM LISTENING 2342381971 109187/java /tmp/.java_pid1.tmp
      unix 2 [ ] STREAM CONNECTED 2243039212 109187/java
      unix 2 [ ] STREAM CONNECTED 2243034965 109187/java
      

       

      • Jstack information of one of the pods (see attachment,The other two pods have similar information), I suspect that a large number of waiting on condition threads are messing up

       

       

       

       

       

       

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            jung karl
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: