Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-3871

Zookeeper clients fail on dockerized Zookeeper leader changes



    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Works for Me
    • Affects Version/s: 3.5.5, 3.6.1, 3.5.8
    • Fix Version/s: None
    • Component/s: None
    • Labels:



      In a nutshell, my dockerized Zookeeper installation stops working on cluster leader changes.

      The cluster responds to 4-letter commands but when I force a leader change, the clients timeout like forever. A workaround is to run follow up restarts which resolve the issue, usually when the leader returns to the previous state. This affects the high availability of the cluster.


      For example, assuming that a 3-node ZK cluster has the following initial state (State A). All Zookeeper clients work fine in this state.

      ZK 1 ZK 2 ZK 3
      follower follower leader


      and a restart occurs and Zookeeper ends up to this (State B)

      ZK 1 ZK 2 ZK 3
      follower leader follower

      In State B, all client attempts fail to connect and they timeout, like forever. Follow up leader restarts may resolve the issue, usually (but not always) due to a return to the previous state A

      Affected versions

      I have verified that this bug with dockerized Zookeeper in replicated mode on

      • 3.5.5
      • 3.5.8
      • 3.6.1


      Note: On all the examples below replace tortoise with your hostname.

      Deploy a 3-node Zookeeper cluster (could be 5-node) using the official 3.5.8 image.

      docker run -d --name=zkcl01 -p 1493:1493 -p 1494:1494 -p 1495:1495 -h tortoise-zkcl01 -e HOSTNAME=tortoise -e ZOO_PORT=1493 -e ZOO_LOG4J_PROP="INFO,CONSOLE,ROLLINGFILE" -e ZOO_4LW_COMMANDS_WHITELIST=srvr,ruok,mntr,stat -e ZOO_STANDALONE_ENABLED=False -e ZOO_SERVERS="server.1=;1493 server.2=tortoise:1498:1497;1496 server.3=tortoise:1501:1500;1499" -e ZOO_MY_ID=1 zookeeper:3.5.8
      docker run -d --name=zkcl02 -p 1496:1496 -p 1497:1497 -p 1498:1498 -h tortoise-zkcl02 -e HOSTNAME=tortoise -e ZOO_PORT=1496 -e ZOO_LOG4J_PROP="INFO,CONSOLE,ROLLINGFILE" -e ZOO_4LW_COMMANDS_WHITELIST=srvr,ruok,mntr,stat -e ZOO_STANDALONE_ENABLED=False -e ZOO_SERVERS="server.1=tortoise:1495:1494;1493 server.2=;1496 server.3=tortoise:1501:1500;1499" -e ZOO_MY_ID=2 zookeeper:3.5.8
      docker run -d --name=zkcl03 -p 1499:1499 -p 1500:1500 -p 1501:1501 -h tortoise-zkcl03 -e HOSTNAME=tortoise -e ZOO_PORT=1499 -e ZOO_LOG4J_PROP="INFO,CONSOLE,ROLLINGFILE" -e ZOO_4LW_COMMANDS_WHITELIST=srvr,ruok,mntr,stat -e ZOO_STANDALONE_ENABLED=False -e ZOO_SERVERS="server.1=tortoise:1495:1494;1493 server.2=tortoise:1498:1497;1496 server.3=;1499" -e ZOO_MY_ID=3 zookeeper:3.5.8


      Monitor cluster's state with the 4-letter srvr command

      watch -n 1 'for i in 1493 1496 1499; do echo $i; echo srvr | nc tortoise $i ; echo; done'


      Verify that you can connect to the cluster successfully using any client (zkCli.sh in this case)

      docker exec -ti zkcl01 bin/zkCli.sh -server tortoise:1493,tortoise:1496,tortoise:1499 ls /
      WatchedEvent state:SyncConnected type:None path:null


      Stop/Start the leader node (based on srvr output from the previous step) in order to force a leader change.

      docker stop zkcl03; sleep 15; docker start zkcl03


      Verify that the client now fails to connect and they timeout.

      docker exec -ti zkcl01 bin/zkCli.sh -server tortoise:1493,tortoise:1496,tortoise:1499 ls /
      closing socket connection and attempting reconnect
      KeeperErrorCode = ConnectionLoss for /


      Finally, restart stop/sleep/start the leader a few more times only to verify that the client succeeds usually when the leader goes back to the initial state.


      This must be a bug unless there is a misconfiguration that I am missing.




            • Assignee:
              kochrist ko christ
            • Votes:
              0 Vote for this issue
              1 Start watching this issue


              • Created: