[ZOOKEEPER-3920] Zookeeper clients timeout after leader change due to 0.0.0.0 address when in docker environment - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 3.6.1
Fix Version/s: 3.6.2
Component/s: quorum, server
Labels:
None

Description

[Sorry I believe this is a dupe of https://issues.apache.org/jira/browse/ZOOKEEPER-3828 and potentially https://issues.apache.org/jira/browse/ZOOKEEPER-3466

But i am not able to attach files there for some reason so creating a new issue which hopefully allows me]

We are encountering an issue where failing over from the leader results in zookeeper clients not being able to connect successfully. They timeout waiting for a response from the server. We are attempting to upgrade some existing zookeeper clusters from 3.4.14 to 3.6.1 (not sure if relevant but stating incase it helps with pinpointing issue) which is effectively blocked by this issue. We perform the rolling upgrade (followers first then leader last) and it seems to go successfully by all indicators. But we end up in the state described in this issue where if the leader changes (either due to restart or stopping) the cluster does not seem able to start new sessions.

I've gathered some TRACE logs from our servers and will attach in the hopes they can help figure this out.

Attached zk_repro.zip which contains the following:

zoo.cfg used in one of the instances (they are all the same except for the local server's ip being 0.0.0.0 in each)
zoo.cfg.dynamic.next (don't think this is used anywhere but is written by zookeeper at some point - I think when the first 3.6.1 container becomes leader based on the value – the file is in all containers and is the same in all as well)
s{1,2,3}_zk.log - logs from each of the 3 servers. Estimated time of repro start indicated by "// REPRO START" text and whitespace in logs
repro_steps.txt - rough steps executed that result in the server logs attached

I'll summarize the repro here also:

Initially it appears to be a healthy 3 node ensemble all running 3.6.1. Server ids are 1,2,3 and 3 is the leader. Dynamic config/reconfiguration is disabled.
invoke srvr on each node (to verify setup and also create bookmark in logs)
Do a zkCli get of /zookeeper/quota which succeeds
Do a restart of the leader (to same image/config) (server 2 now becomes leader, 3 is back as follower)
Try to perform the same zkCli get which times out (this get is done within the container)
Try to perform the same zkCli get but from another machine, this also times out
Invoke srvr on each node again (to verify that 2 is now the leader/bookmark)
Do a restart of server 2 (3 becomes leader, 2 follower)
Do a zkCli get of /zookeeper/quota which succeeds
Invoke srvr on each node again (to verify that 3 is leader)

I tried to keep the other ZK traffic to a minimum but there are likely some periodic mntr requests mixed from our metrics scraper.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

stack.yml
27/Aug/20 04:18
2 kB
Andre Price
zk_repro.zip
26/Aug/20 17:14
97 kB
Andre Price

Issue Links

is fixed by

ZOOKEEPER-3829 Zookeeper refuses request after node expansion

Closed

is related to

ZOOKEEPER-3940 Zookeeper restart of leader causes all zk nodes to not serve requests

Open

ZOOKEEPER-3466 ZK cluster converges, but does not properly handle client connections (new in 3.5.5)

Open

ZOOKEEPER-3828 zookeeper clients gets connection timeout when the leader node is restarted

Open

ZOOKEEPER-3829 Zookeeper refuses request after node expansion

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Andre Price

Votes:: 2 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 26/Aug/20 17:15

Updated:: 25/Sep/20 05:58

Resolved:: 10/Sep/20 03:11