Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-3756

Members failing to rejoin quorum

    XMLWordPrintableJSON

Details

    Description

      Not sure if this is the place to ask, please close if it's not.

      I am seeing some behavior that I can't explain since upgrading to 3.5:

      In a 5 member quorum, when server 3 is the leader and each server has this in their configuration: 

      server.1=100.71.255.254:2888:3888:participant;2181
      server.2=100.71.255.253:2888:3888:participant;2181
      server.3=100.71.255.252:2888:3888:participant;2181
      server.4=100.71.255.251:2888:3888:participant;2181
      server.5=100.71.255.250:2888:3888:participant;2181

      If servers 1 or 2 are restarted, they fail to rejoin the quorum with this in the logs:

      2020-03-11 20:23:35,720 [myid:2] - INFO  [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):QuorumPeer@1175] - LOOKING
      2020-03-11 20:23:35,721 [myid:2] - INFO  [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):FastLeaderElection@885] - New election. My id =  2, proposed zxid=0x1b8005f4bba
      2020-03-11 20:23:35,733 [myid:2] - INFO  [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, so dropping the connection: (3, 2)
      2020-03-11 20:23:35,734 [myid:2] - INFO  [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection request 100.126.116.201:36140
      2020-03-11 20:23:35,735 [myid:2] - INFO  [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, so dropping the connection: (4, 2)
      2020-03-11 20:23:35,740 [myid:2] - INFO  [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, so dropping the connection: (5, 2)
      2020-03-11 20:23:35,740 [myid:2] - INFO  [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection request 100.126.116.201:36142
      2020-03-11 20:23:35,740 [myid:2] - INFO  [WorkerReceiver[myid=2]:FastLeaderElection@679] - Notification: 2 (message format version), 2 (n.leader), 0x1b8005f4bba (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x1b8 (n.peerEPoch), LOOKING (my state)0 (n.config version)
      2020-03-11 20:23:35,742 [myid:2] - WARN  [SendWorker:3:QuorumCnxManager$SendWorker@1143] - Interrupted while waiting for message on queue
      java.lang.InterruptedException
              at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
              at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
              at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1294)
              at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:82)
              at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1131)
      2020-03-11 20:23:35,744 [myid:2] - WARN  [SendWorker:3:QuorumCnxManager$SendWorker@1153] - Send worker leaving thread  id 3 my id = 2
      2020-03-11 20:23:35,745 [myid:2] - WARN  [RecvWorker:3:QuorumCnxManager$RecvWorker@1230] - Interrupting SendWorker

      The only way I can seem to get them to rejoin the quorum is to restart the leader.

      However, if I remove server 4 and 5 from the configuration of server 1 or 2 (so only servers 1, 2, and 3 remain in the configuration file), then they can rejoin the quorum fine. Is this expected and am I doing something wrong? Any help or explanation would be greatly appreciated. Thank you.

      Attachments

        1. docker-entrypoint.sh
          1 kB
          Dai Shi
        2. Dockerfile
          2 kB
          Dai Shi
        3. jmx.yaml
          0.7 kB
          Dai Shi
        4. zookeeper.yaml
          2 kB
          Dai Shi
        5. configmap.yaml
          2 kB
          Dai Shi
        6. zoo-service.yaml
          3 kB
          Dai Shi
        7. zoo-2.log
          39 kB
          Dai Shi
        8. zoo-1.log
          121 kB
          Dai Shi
        9. zoo-0.log
          211 kB
          Dai Shi

        Issue Links

          Activity

            People

              symat Mate Szalay-Beko
              dshi Dai Shi
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3.5h
                  3.5h