Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-2386

Cannot achieve quorum when middle server (in a q of 3) is unreacable

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      Recently, we've observed a curious case where a quorum was not reached for days in a cluster of 3 nodes (zk0, zk1, zk2) and the middle node zk1 is unreachable from network.

      The leader election happens, and both zk0 and zk2 starts the vote. Then each server sends notifications to every other server including itself. The problem is that, zk1 vm is unavailable, so when we are trying to open up a socket to connect to that server with socket timeout of 5 seconds, it delays the notification processing of the vote sent from zk2 to zk2 (itself). The vote eventually comes after 5 sec, but by that time, the follower (zk0) already converted to the follower state. On the follower state, the follower try to connect to leader 5 times with 1 second timeout (5 sec in total). Since the leader does not start its peer port for 5 seconds after the follower starts, the follower always times out connecting to the leader. This cycle is repeating for hours / days even after restarting the servers several times.

      I believe this is related to the default timeouts (5 sec socket timeout) and follower to leader connection timeout (5 tries with 1 second timeout). Only after setting the zookeeper.cnxTimeout to 1 second, the quorum was operating.

      More logs coming shortly.

      zoo.cfg:

      server.3=zk2-hostname:2889:3889
      server.2=zk1-hostname:2889:3889
      server.1=zk0-hostname:2889:3889
      

      Attachments

        1. zklogs.tar.gz
          23 kB
          Enis Soztutar

        Activity

          People

            Unassigned Unassigned
            enis Enis Soztutar
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: