[ZOOKEEPER-3756] Members failing to rejoin quorum - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 3.5.6, 3.5.7
Fix Version/s: 3.6.1, 3.5.8
Component/s: leaderElection
Labels:
- pull-request-available

Description

Not sure if this is the place to ask, please close if it's not.

I am seeing some behavior that I can't explain since upgrading to 3.5:

In a 5 member quorum, when server 3 is the leader and each server has this in their configuration:

server.1=100.71.255.254:2888:3888:participant;2181
server.2=100.71.255.253:2888:3888:participant;2181
server.3=100.71.255.252:2888:3888:participant;2181
server.4=100.71.255.251:2888:3888:participant;2181
server.5=100.71.255.250:2888:3888:participant;2181

If servers 1 or 2 are restarted, they fail to rejoin the quorum with this in the logs:

2020-03-11 20:23:35,720 [myid:2] - INFO  [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):QuorumPeer@1175] - LOOKING
2020-03-11 20:23:35,721 [myid:2] - INFO  [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):FastLeaderElection@885] - New election. My id =  2, proposed zxid=0x1b8005f4bba
2020-03-11 20:23:35,733 [myid:2] - INFO  [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, so dropping the connection: (3, 2)
2020-03-11 20:23:35,734 [myid:2] - INFO  [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection request 100.126.116.201:36140
2020-03-11 20:23:35,735 [myid:2] - INFO  [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, so dropping the connection: (4, 2)
2020-03-11 20:23:35,740 [myid:2] - INFO  [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, so dropping the connection: (5, 2)
2020-03-11 20:23:35,740 [myid:2] - INFO  [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection request 100.126.116.201:36142
2020-03-11 20:23:35,740 [myid:2] - INFO  [WorkerReceiver[myid=2]:FastLeaderElection@679] - Notification: 2 (message format version), 2 (n.leader), 0x1b8005f4bba (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x1b8 (n.peerEPoch), LOOKING (my state)0 (n.config version)
2020-03-11 20:23:35,742 [myid:2] - WARN  [SendWorker:3:QuorumCnxManager$SendWorker@1143] - Interrupted while waiting for message on queue
java.lang.InterruptedException
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
        at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1294)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:82)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1131)
2020-03-11 20:23:35,744 [myid:2] - WARN  [SendWorker:3:QuorumCnxManager$SendWorker@1153] - Send worker leaving thread  id 3 my id = 2
2020-03-11 20:23:35,745 [myid:2] - WARN  [RecvWorker:3:QuorumCnxManager$RecvWorker@1230] - Interrupting SendWorker

The only way I can seem to get them to rejoin the quorum is to restart the leader.

However, if I remove server 4 and 5 from the configuration of server 1 or 2 (so only servers 1, 2, and 3 remain in the configuration file), then they can rejoin the quorum fine. Is this expected and am I doing something wrong? Any help or explanation would be greatly appreciated. Thank you.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

configmap.yaml
13/Mar/20 18:15
2 kB
Dai Shi
docker-entrypoint.sh
13/Mar/20 17:54
1 kB
Dai Shi
Dockerfile
13/Mar/20 17:54
2 kB
Dai Shi
jmx.yaml
13/Mar/20 17:54
0.7 kB
Dai Shi
zoo-0.log
13/Mar/20 19:23
211 kB
Dai Shi
zoo-1.log
13/Mar/20 19:23
121 kB
Dai Shi
zoo-2.log
13/Mar/20 19:23
39 kB
Dai Shi
zookeeper.yaml
13/Mar/20 18:15
2 kB
Dai Shi
zoo-service.yaml
13/Mar/20 18:27
3 kB
Dai Shi

Issue Links

is related to

ZOOKEEPER-2164 fast leader election keeps failing

Closed

relates to

ZOOKEEPER-3838 Async handling of quorum connection requests, including SSL handshakes

Open

links to

GitHub Pull Request #1289

GitHub Pull Request #1293

Activity

People

Assignee:: Mate Szalay-Beko

Reporter:: Dai Shi

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 11/Mar/20 20:43

Updated:: 20/May/20 07:06

Resolved:: 23/Mar/20 15:20

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

3.5h