Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
3.4.5
Description
I have a 3-node cluster with sids 1, 2 and 3. Originally 2 is the leader. When I shut down 2, 1 and 3 keep going back to leader election. Here is what seems to be happening.
- Both 1 and 3 elect 3 as the leader.
- 1 receives votes from 3 and itself, and starts trying to connect to 3 as a follower.
- 3 doesn't receive votes for 5 seconds because connectOne() to 2 doesn't timeout for 5 seconds: https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L346
- By the time 3 receives votes, 1 has given up trying to connect to 3: https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L247
I'm using 3.4.5, but it looks like this part of the code hasn't changed for a while, so I'm guessing later versions have the same issue.
Attachments
Issue Links
- is duplicated by
-
ZOOKEEPER-3725 Zookeeper fails to establish quorum with 2 servers using 3.5.6
- Resolved
- relates to
-
ZOOKEEPER-2080 Fix deadlock in dynamic reconfiguration
- Closed
-
ZOOKEEPER-3756 Members failing to rejoin quorum
- Closed
- requires
-
ZOOKEEPER-900 FLE implementation should be improved to use non-blocking sockets
- Open
- links to