Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-4724

follower can't connect to the right leader and quorum failed to form

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.6.4
    • None
    • None
    • None

    Description

      When entering "following - discovery" state, the follower will connect to the leader node to reach a quorum. But recently, a user faced the issue that the follower can't connect to the right leader and quorum failed to form. From the log, I can see the follower is trying to connect to itself (0.0.0.0:2888), instead of the leader. After 5 retries, a new election started, and all the things happen again, that is, the node becomes a follower, and try to connect to itself, and again, and again...

       

      The log is like this:

      2023-07-25 06:47:54,982 INFO FOLLOWING - LEADER ELECTION TOOK - 9802 MS (org.apache.zookeeper.server.quorum.Learner) [QuorumPeer[myid=1](plain=127.0.0.1:12181)(secure=[0:0:0:0:0:0:0:0]:2181)]
      2023-07-25 06:47:54,983 INFO Peer state changed: following - discovery (org.apache.zookeeper.server.quorum.QuorumPeer) [QuorumPeer[myid=1](plain=127.0.0.1:12181)(secure=[0:0:0:0:0:0:0:0]:2181)]
      2023-07-25 06:47:54,984 WARN Unexpected exception, tries=0, remaining init limit=10000, connecting to /0.0.0.0:2888 (org.apache.zookeeper.server.quorum.Learner) [LeaderConnector-/0.0.0.0:2888]
      java.net.ConnectException: Connection refused
          at java.base/sun.nio.ch.Net.pollConnect(Native Method)
          at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
          at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:542)
          at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597)
          at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
          at java.base/java.net.Socket.connect(Socket.java:633)
          at java.base/sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:304)
          at org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:292)
          at org.apache.zookeeper.server.quorum.Learner$LeaderConnector.connectToLeader(Learner.java:408)
          at org.apache.zookeeper.server.quorum.Learner$LeaderConnector.run(Learner.java:366)

       

      One thing I found, is this issue happened after "Restarting leader election" on the follower node. Not sure if it is related.

       

      I was thinking if it is some race condition between "restarting leader election" happened (reset vote to itself) and vote update. But as mentioned above, this issue keeps happening after next round of leader election.

       

      The configuration and setup:

      1. 2 zookeeper nodes
      2. each zookeeper node, we set the IP of itself to 0.0.0.0, to workaround slow DNS in k8s issue (i.e. ZOOKEEPER-4708). That is,
        For node 1, we have:
      server.1=0.0.0.0:2888:3888:participant;127.0.0.1:12181
      server.2=dev-dev2-zookeeper-1.dev-dev2-zookeeper-nodes.kafka.svc:2888:3888:participant;127.0.0.1:12181

      For node 2, we have:

      server.1=dev-dev2-zookeeper-0.dev-dev2-zookeeper-nodes.kafka.svc:2888:3888:participant;127.0.0.1:12181
      server.2=0.0.0.0:2888:3888:participant;127.0.0.1:12181 

      Logs:

      zookeeper-custom-image-rep1.txt
      zookeeper-custom-image-rep2.txt

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              showuon Luke Chen
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: