Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-1515

Long reconnect timeout if leader failed.

Add voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.3.5
    • None
    • leaderElection, quorum, server
    • Gentoo linux, but every environment is affected.

    Description

      In zookeeper 3.3.5 in file src/java/main/org/apache/zookeeper/server/quorum/Learner.java:325 you may see Thread.sleep(1000);

      This is always happens after leader failure or restart. Zookeeper reelects new leader and all followers try to connect to it. But first attempt always fails because of "Connection refused":

      2012-07-23 18:55:48,159 - WARN [QuorumPeer:/0.0.0.0:2181:Learner@229] - Unexpected exception, tries=0, connecting to web329.local/192.168.1.74:2888
      java.net.ConnectException: Connection refused
      at java.net.PlainSocketImpl.socketConnect(Native Method)
      at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
      at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
      at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
      at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
      at java.net.Socket.connect(Socket.java:529)
      at org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:221)
      at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:65)
      at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:645)

      I propose to change this line to the next code:

      Learner.java
      if (tries > 0) {
          Thread.sleep(self.tickTime);
      }
      

      This way first reconnect attempt will be done immediately, other will wait for tick time (this is good semantic change, I suppose).

      The result of this change - leader reelection time lowered from >1500ms to 300-400ms with 50ms tick time. This is pretty important for our production environment and will not break any existing installations.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            bobrik Ivan Babrou

            Dates

              Created:
              Updated:

              Slack

                Issue deployment