[ZOOKEEPER-3940] Zookeeper restart of leader causes all zk nodes to not serve requests - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Critical
Resolution: Unresolved
Affects Version/s: 3.6.2
Fix Version/s: None
Component/s: quorum, server
Labels:
None
Environment:

Hide

dataDir=/data
dataLogDir=/datalog
tickTime=2000
initLimit=10
syncLimit=5
maxClientCnxns=60
autopurge.snapRetainCount=10
autopurge.purgeInterval=24
leaderServes=yes
standaloneEnabled=false
admin.enableServer=false
snapshot.trust.empty=true
audit.enable=true
4lw.commands.whitelist=*
sslQuorum=true
quorumListenOnAllIPs=true
portUnification=false
serverCnxnFactory=org.apache.zookeeper.server.NettyServerCnxnFactory
ssl.quorum.keyStore.location=/apache-zookeeper-3.6.2-bin/java/keystore_zoo1.jks
ssl.quorum.keyStore.password=********
ssl.quorum.trustStore.location=/apache-zookeeper-3.6.2-bin/java/truststore.jks
ssl.quorum.trustStore.password=********
ssl.quorum.protocol=TLSv1.2
ssl.quorum.enabledProtocols=TLSv1.2
ssl.client.enable=true
secureClientPort=2281
client.portUnification=true
clientCnxnSocket=org.apache.zookeeper.ClientCnxnSocketNetty
ssl.keyStore.location=/apache-zookeeper-3.6.2-bin/java/keystore_zoo1.jks
ssl.keyStore.password=********
ssl.trustStore.location=/apache-zookeeper-3.6.2-bin/java/truststore.jks
ssl.trustStore.password=********
ssl.protocol=TLSv1.2
ssl.enabledProtocols=TLSv1.2
reconfigEnabled=false
server.1=zoo1:2888:3888:participant;2181
server.2=zoo2:2888:3888:participant;2181
server.3=zoo3:2888:3888:participant;2181

Show
dataDir=/data dataLogDir=/datalog tickTime=2000 initLimit=10 syncLimit=5 maxClientCnxns=60 autopurge.snapRetainCount=10 autopurge.purgeInterval=24 leaderServes=yes standaloneEnabled=false admin.enableServer=false snapshot.trust.empty=true audit.enable=true 4lw.commands.whitelist=* sslQuorum=true quorumListenOnAllIPs=true portUnification=false serverCnxnFactory=org.apache.zookeeper.server.NettyServerCnxnFactory ssl.quorum.keyStore.location=/apache-zookeeper-3.6.2-bin/java/keystore_zoo1.jks ssl.quorum.keyStore.password=******** ssl.quorum.trustStore.location=/apache-zookeeper-3.6.2-bin/java/truststore.jks ssl.quorum.trustStore.password=******** ssl.quorum.protocol=TLSv1.2 ssl.quorum.enabledProtocols=TLSv1.2 ssl.client.enable=true secureClientPort=2281 client.portUnification=true clientCnxnSocket=org.apache.zookeeper.ClientCnxnSocketNetty ssl.keyStore.location=/apache-zookeeper-3.6.2-bin/java/keystore_zoo1.jks ssl.keyStore.password=******** ssl.trustStore.location=/apache-zookeeper-3.6.2-bin/java/truststore.jks ssl.trustStore.password=******** ssl.protocol=TLSv1.2 ssl.enabledProtocols=TLSv1.2 reconfigEnabled=false server.1=zoo1:2888:3888:participant;2181 server.2=zoo2:2888:3888:participant;2181 server.3=zoo3:2888:3888:participant;2181

Description

We have configured a 3 node zookeeper cluster using the 3.6.2 version in a Docker version 1.12.1 containerized environment. This corresponds to Sep 16 20:03:01 in the attached docker-containers.log files.

NOTE: We use the Dockerfile from https://hub.docker.com/_/zookeeper for 3.6 branch

As a part of our testing, we have restarted each of the zookeeper nodes and have seen the following behaviour:

zoo1, zoo2, and zoo3 healthy (zoo1 is leader)

We started our testing at approximately Sep 17 13:01:05 in the attached docker-containers.log files.

1. (simulate patching zoo2)

restart zoo2
zk_synced_followers 1
zoo1 leader
zoo2 unhealthy (This ZooKeeper instance is not currently serving requests)
zoo3 healthy
waited 5 minutes with no change
restart zoo3
zoo1 leader
zk_synced_followers 1
zoo2 unhealthy (This ZooKeeper instance is not currently serving requests)
zoo3 healthy
restart zoo2
no changes
restart zoo3
zoo1 leader
zk_synced_followers 2
zoo2 healthy
zoo3 unhealthy (This ZooKeeper instance is not currently serving requests)
waited 5 minutes and zoo3 returned to healthy

2. simulate patching zoo3

zoo1 leader
restart zoo3
zk_synced_followers 2
zoo1, zoo2, and zoo3 healthy

3. simulate patching zoo1

zoo1 leader
restart zoo1
zoo1, zoo2, and zoo3 unhealthy (This ZooKeeper instance is not currently serving requests)
waited 5 minutes to see if they resolve Sep 17 14:39 - Sep 17 14:44
tried restarting in this order: zoo2, zoo3, zoo1 and no change; all still unhealthy (this step was not collected in the log files).

The third case in the above scenarios is the critical one since we are no longer able to start any of the zk nodes.

maoling this issue may relate to https://issues.apache.org/jira/browse/ZOOKEEPER-3920 which corresponds to the first and second cases above that I am working with blb93 on.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

nossl-zoo.cfg
25/Sep/20 16:26
0.4 kB
Stan Henderson
zk-docker-containers.log.zip
21/Sep/20 13:59
808 kB
Stan Henderson
zk-docker-containers-nossl.log.zip
25/Sep/20 16:26
41 kB
Stan Henderson
zoo.cfg
15/Oct/20 23:24
0.6 kB
Stan Henderson
zoo.cfg
21/Sep/20 13:59
1 kB
Stan Henderson
zoo1-docker-containers.log
06/Oct/20 14:54
322 kB
Stan Henderson
zoo1-docker-containers.log
03/Oct/20 15:44
62 kB
Stan Henderson
zoo1-follower.log
15/Oct/20 23:24
95 kB
Stan Henderson
zoo2-docker-containers.log
03/Oct/20 16:51
95 kB
Stan Henderson
zoo2-leader.log
15/Oct/20 23:24
150 kB
Stan Henderson
zoo3-docker-containers.log
03/Oct/20 16:51
90 kB
Stan Henderson
zoo3-follower.log
15/Oct/20 23:24
142 kB
Stan Henderson

Issue Links

relates to

ZOOKEEPER-3920 Zookeeper clients timeout after leader change due to 0.0.0.0 address when in docker environment

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Stan Henderson

Votes:: 3 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 21/Sep/20 14:04

Updated:: 27/Aug/21 17:28