[ZOOKEEPER-3871] Zookeeper clients fail on dockerized Zookeeper leader changes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Works for Me
Affects Version/s: 3.5.5, 3.6.1, 3.5.8
Fix Version/s: None
Component/s: None
Labels:
None

Description

In a nutshell, my dockerized Zookeeper installation stops working on cluster leader changes.

The cluster responds to 4-letter commands but when I force a leader change, the clients timeout like forever. A workaround is to run follow up restarts which resolve the issue, usually when the leader returns to the previous state. This affects the high availability of the cluster.

Example

For example, assuming that a 3-node ZK cluster has the following initial state (State A). All Zookeeper clients work fine in this state.

ZK 1	ZK 2	ZK 3
follower	follower	leader

and a restart occurs and Zookeeper ends up to this (State B)

ZK 1	ZK 2	ZK 3
follower	leader	follower

In State B, all client attempts fail to connect and they timeout, like forever. Follow up leader restarts may resolve the issue, usually (but not always) due to a return to the previous state A.

Affected versions

I have verified that this bug with dockerized Zookeeper in replicated mode on

3.5.5
3.5.8
3.6.1

Reproduce

Note: On all the examples below replace tortoise with your hostname.

Deploy a 3-node Zookeeper cluster (could be 5-node) using the official 3.5.8 image.

docker run -d --name=zkcl01 -p 1493:1493 -p 1494:1494 -p 1495:1495 -h tortoise-zkcl01 -e HOSTNAME=tortoise -e ZOO_PORT=1493 -e ZOO_LOG4J_PROP="INFO,CONSOLE,ROLLINGFILE" -e ZOO_4LW_COMMANDS_WHITELIST=srvr,ruok,mntr,stat -e ZOO_STANDALONE_ENABLED=False -e ZOO_SERVERS="server.1=0.0.0.0:1495:1494;1493 server.2=tortoise:1498:1497;1496 server.3=tortoise:1501:1500;1499" -e ZOO_MY_ID=1 zookeeper:3.5.8
docker run -d --name=zkcl02 -p 1496:1496 -p 1497:1497 -p 1498:1498 -h tortoise-zkcl02 -e HOSTNAME=tortoise -e ZOO_PORT=1496 -e ZOO_LOG4J_PROP="INFO,CONSOLE,ROLLINGFILE" -e ZOO_4LW_COMMANDS_WHITELIST=srvr,ruok,mntr,stat -e ZOO_STANDALONE_ENABLED=False -e ZOO_SERVERS="server.1=tortoise:1495:1494;1493 server.2=0.0.0.0:1498:1497;1496 server.3=tortoise:1501:1500;1499" -e ZOO_MY_ID=2 zookeeper:3.5.8
docker run -d --name=zkcl03 -p 1499:1499 -p 1500:1500 -p 1501:1501 -h tortoise-zkcl03 -e HOSTNAME=tortoise -e ZOO_PORT=1499 -e ZOO_LOG4J_PROP="INFO,CONSOLE,ROLLINGFILE" -e ZOO_4LW_COMMANDS_WHITELIST=srvr,ruok,mntr,stat -e ZOO_STANDALONE_ENABLED=False -e ZOO_SERVERS="server.1=tortoise:1495:1494;1493 server.2=tortoise:1498:1497;1496 server.3=0.0.0.0:1501:1500;1499" -e ZOO_MY_ID=3 zookeeper:3.5.8

Monitor cluster's state with the 4-letter srvr command

watch -n 1 'for i in 1493 1496 1499; do echo $i; echo srvr | nc tortoise $i ; echo; done'

Verify that you can connect to the cluster successfully using any client (zkCli.sh in this case)

docker exec -ti zkcl01 bin/zkCli.sh -server tortoise:1493,tortoise:1496,tortoise:1499 ls /
...
...
WatchedEvent state:SyncConnected type:None path:null
[zookeeper]

Stop/Start the leader node (based on srvr output from the previous step) in order to force a leader change.

docker stop zkcl03; sleep 15; docker start zkcl03

Verify that the client now fails to connect and they timeout.

docker exec -ti zkcl01 bin/zkCli.sh -server tortoise:1493,tortoise:1496,tortoise:1499 ls /
...
...
closing socket connection and attempting reconnect
KeeperErrorCode = ConnectionLoss for /

Finally, ~~restart~~ stop/sleep/start the leader a few more times only to verify that the client succeeds usually when the leader goes back to the initial state.

This must be a bug unless there is a misconfiguration that I am missing.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: ko christ

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 25/Jun/20 12:47

Updated:: 26/Jun/20 12:20

Resolved:: 26/Jun/20 12:20