Details
-
Bug
-
Status: Open
-
Blocker
-
Resolution: Unresolved
-
3.5.5
-
None
-
None
Description
I'm running Zookeeper from the 3.5.5 docker image in k8s. 3 pod are located on three nodes,which communicates with each other with k8s cluset service. I did the following operates: shut down node 2 for 3 hours and then joined the cluster. At this time, I found that zookeeper could not respond to* any requests(include zkCli.sh)*. I used the runok and mntr commands to test the server, and only found one exception: outstanding_requests is abnormally high and keeps growing (I have a Kafka cluster using it)
my cluster:
//my zookeeper pods [root@node1 ~]# kubectl get po -nsa -o wide | grep zookeeper-dp itoa-zookeeper-dp1-5bc799c97f-t7wpz 1/1 Running 0 17h 177.177.166.244 node1 <none> <none> itoa-zookeeper-dp2-6d89756876-k2ds2 1/1 Running 0 13h 177.177.104.35 node2 <none> <none> itoa-zookeeper-dp3-b55ddd6c7-8h5p7 1/1 Running 0 16h 177.177.135.54 node3 <none> <none> //my zookeeper cluster svc [root@node1 ~]# kubectl get svc -nsa -o wide | grep zookeeper-dp 44h node=itoa-zookeeper-dp3 itoa-zookeeper-service1 ClusterIP 10.96.235.75 <none> 2181/TCP,2888/TCP,3888/TCP 44h node=itoa-zookeeper-dp1 itoa-zookeeper-service2 ClusterIP 10.96.161.122 <none> 2181/TCP,2888/TCP,3888/TCP 44h node=itoa-zookeeper-dp2 itoa-zookeeper-service3 ClusterIP 10.96.67.1 <none> 2181/TCP,2888/TCP,3888/TCP 44h node=itoa-zookeeper-dp3
my cluster conf:
[root@node1 ~]# echo conf | nc 177.177.166.244 2181 clientPort=2181 secureClientPort=-1 dataDir=/opt/zookeeper/data/version-2 dataDirSize=135540411 dataLogDir=/opt/zookeeper/data/version-2 dataLogSize=135540411 tickTime=2000 maxClientCnxns=60 minSessionTimeout=4000 maxSessionTimeout=40000 serverId=1 initLimit=10 syncLimit=5 electionAlg=3 electionPort=3888 quorumPort=2888 peerType=0 membership: server.1=0.0.0.0:2888:3888:participant server.2=itoa-zookeeper-service2:2888:3888:participant server.3=itoa-zookeeper-service3:2888:3888:participant version=0
runok & mntr response
//177.177.166.244 [root@node1 ~]# echo ruok | nc 177.177.166.244 2181 && echo mntr | nc 177.177.166.244 2181 imokzk_version 3.5.5-390fe37ea45dee01bf87dc1c042b5e3dcce88653, built on 05/03/2019 12:07 GMT zk_avg_latency 2 zk_max_latency 2 zk_min_latency 2 zk_packets_received 740 zk_packets_sent 105 zk_num_alive_connections 1 zk_outstanding_requests 634 zk_server_state follower zk_znode_count 8092 zk_watch_count 0 zk_ephemerals_count 5 zk_approximate_data_size 708815 zk_open_file_descriptor_count 100 zk_max_file_descriptor_count 1048576 //177.177.135.54 [root@node1 ~]# echo ruok | nc 177.177.135.54 2181 && echo mntr | nc 177.177.135.54 2181 imokzk_version 3.5.5-390fe37ea45dee01bf87dc1c042b5e3dcce88653, built on 05/03/2019 12:07 GMT zk_avg_latency 0 zk_max_latency 0 zk_min_latency 0 zk_packets_received 727 zk_packets_sent 107 zk_num_alive_connections 1 zk_outstanding_requests 619 zk_server_state leader zk_znode_count 8092 zk_watch_count 0 zk_ephemerals_count 5 zk_approximate_data_size 708815 zk_open_file_descriptor_count 101 zk_max_file_descriptor_count 1048576 zk_followers 1 zk_synced_followers 1 zk_pending_syncs 0 zk_last_proposal_size 32 zk_max_proposal_size 162 zk_min_proposal_size 32
zkCli..sh responese 'client session timed out'
[zk: itoa-zookeeper-service1:2181,itoa-zookeeper-service2:2181,itoa-zookeeper-service3:2181(CONNECTING) 0] ls / 2021-04-29 11:08:22,459 [myid:itoa-zookeeper-service3:2181] - WARN [main-SendThread(itoa-zookeeper-service3:2181):ClientCnxn$SendThread@1190] - Client session timed out, have not heard from server in 10007ms for sessionid 0x0 2021-04-29 11:08:22,459 [myid:itoa-zookeeper-service3:2181] - INFO [main-SendThread(itoa-zookeeper-service3:2181):ClientCnxn$SendThread@1238] - Client session timed out, have not heard from server in 10007ms for sessionid 0x0, closing socket connection and attempting reconnect KeeperErrorCode = ConnectionLoss for / [zk: itoa-zookeeper-service1:2181,itoa-zookeeper-service2:2181,itoa-zookeeper-service3:2181(CONNECTING) 1] 2021-04-29 11:08:22,743 [myid:itoa-zookeeper-service2:2181] - INFO [main-SendThread(itoa-zookeeper-service2:2181):ClientCnxn$SendThread@1112] - Opening socket connection to server itoa-zookeeper-service2/10.96.161.122:2181. Will not attempt to authenticate using SASL (unknown error) 2021-04-29 11:08:22,744 [myid:itoa-zookeeper-service2:2181] - INFO [main-SendThread(itoa-zookeeper-service2:2181):ClientCnxn$SendThread@959] - Socket connection established, initiating session, client: /177.177.166.244:45456, server: itoa-zookeeper-service2/10.96.161.122:2181 2021-04-29 11:08:22,747 [myid:itoa-zookeeper-service2:2181] - INFO [main-SendThread(itoa-zookeeper-service2:2181):ClientCnxn$SendThread@1240] - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect 2021-04-29 11:08:23,761 [myid:itoa-zookeeper-service1:2181] - INFO [main-SendThread(itoa-zookeeper-service1:2181):ClientCnxn$SendThread@1112] - Opening socket connection to server itoa-zookeeper-service1/10.96.235.75:2181. Will not attempt to authenticate using SASL (unknown error) 2021-04-29 11:08:23,762 [myid:itoa-zookeeper-service1:2181] - INFO [main-SendThread(itoa-zookeeper-service1:2181):ClientCnxn$SendThread@959] - Socket connection established, initiating session, client: /177.177.166.244:42428, server: itoa-zookeeper-service1/10.96.235.75:2181
I shutted down node2 at 2021-04-28 17:12,uped at 20:40. my pod logs see in the attachment(Due to the limitation of jira, I split the log of pod2)
These other information I collected:
- The tcp link information of one of the pods(no abnormalities in 3 pods):
[root@node1 ~]# nsenter -n -t $(docker inspect -f '{{.State.Pid}}' $(docker ps | grep zookeeper-dp1 | grep -v pause | awk '{print $1}')) netstat -plan Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 0.0.0.0:35883 0.0.0.0:* LISTEN 109187/java tcp 0 0 0.0.0.0:3888 0.0.0.0:* LISTEN 109187/java tcp 0 0 0.0.0.0:8080 0.0.0.0:* LISTEN 109187/java tcp 0 0 0.0.0.0:2181 0.0.0.0:* LISTEN 109187/java tcp 0 0 177.177.166.244:3888 177.177.104.35:57886 ESTABLISHED 109187/java tcp 0 0 177.177.166.244:42204 10.96.67.1:2888 ESTABLISHED 109187/java tcp 0 0 177.177.166.244:3888 177.177.135.54:44462 ESTABLISHED 109187/java Active UNIX domain sockets (servers and established) Proto RefCnt Flags Type State I-Node PID/Program name Path unix 2 [ ACC ] STREAM LISTENING 2342381971 109187/java /tmp/.java_pid1.tmp unix 2 [ ] STREAM CONNECTED 2243039212 109187/java unix 2 [ ] STREAM CONNECTED 2243034965 109187/java
- Jstack information of one of the pods (see attachment,The other two pods have similar information), I suspect that a large number of waiting on condition threads are messing up