Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-3902

ZooKeeper cluster did not perform leader reelection when leader host crashed

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.4.13
    • None
    • quorum
    • None

    Description

      I saw this issue in one of the zookeeper cluster where LEADER host crashed due to h/w issue and all the follower hosts immediately closed client connection with leader host but zk-cluster did not trigger leader reelection until we manually restarted few zookeeper host after some time.

      Snapshot size: 200MB
      ZooKeeper version: 3.4.13

      Below logs are from Leader and one of the follower host, when Leader host was crashed.

      1. Leader host: zookeeper-4.us-west.com crashed at 16:21:09 and no application logs.
      2. Follower host: zookeeper-2.us-west.com immediately closed client-connection with zookeeper-4.us-west.com at 16:21:09
      3. None of the Follower hosts triggered LEADER reelection and don't see any such logs
      4. All the zk-clients started getting zookeeper session timeouts
      5. after 27 minutes at 16:48:30, Follower host found out the issue in quorum and logged
      "Have smaller server identifier, so dropping the connection"
      6. But it still did not trigger leader reelection until we manually restarted zk-process in few zk hosts after 5 mins
      7. and finally after restarting process in few zk hosts manually: zk cluster did leader reelection and created the quorum

      Logs:

      Leader zookeeper: zookeeper-4.us-west.com

      Leader zookeeper: zookeeper-4.us-west.com
      Host crashed at : 16:21:10,095 and last log was `16:21:10,095`. and then it came back after an hour: `17:25:27` and joined as FOLLOWER.
      
      16:21:09,965 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@215] - Accepted socket connection from /0.0.0.1:45186
      16:21:09,965 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@908] - Processing mntr command from /0.0.0.1:45186
      16:21:10,095 - INFO [Thread-194783:NIOServerCnxn@1056] - Closed socket connection for client /0.0.0.1:45186 (no session established for client)
      17:25:27,262 - INFO [main:QuorumPeerConfig@136] - Reading configuration from: 
      

       

      Follower zookeeper: zookeeper-2.us-west.com

      16:21:09,743 - INFO [Thread-78288:NIOServerCnxn@1056] - Closed socket connection for client /4.4.4.4:52216 (zookeeper-4.us-west.com-IP) (no session established for client)
      :
      # ZK host didn't trigger the LEADER-REELECTION even despite zookeeper-4.us-west.com crashed and lost connection with that host at : `16:21:09,743`
      # Client was keep connecting and keep disconnecting due to zk session timeout
      16:48:29,961 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@215] - Accepted socket connection from /0.0.0.2:43938
      16:48:29,965 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@949] - Client attempting to establish new session at /0.0.0.2:43938
      16:48:29,977 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@215] - Accepted socket connection from /0.0.0.3:54630
      16:48:29,981 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@949] - Client attempting to establish new session at /0.0.0.3:54630
      :
      # After 27 Minutes: Saw first log from QuorumCnxManager and zk host was figuring out if needs leaer-relection
      16:48:30,475 - INFO [WorkerSender[myid=2]:QuorumCnxManager@347] - Have smaller server identifier, so dropping the connection: (3, 2)
      16:48:30,475 - INFO [WorkerReceiver[myid=2]:FastLeaderElection@595] - Notification: 1 (message format version), 2 (n.leader), 0x6d00328f73 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x6d (n.peerEpoch) LOOKING (my state)
      16:48:30,476 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@908] - Processing mntr command from /2.2.2.2:55136
      16:48:30,476 - INFO [WorkerReceiver[myid=2]:FastLeaderElection@595] - Notification: 1 (message format version), 4 (n.leader), 0x6c00b36a5a (n.zxid), 0x35e (n.round), LOOKING (n.state), 1 (n.sid), 0x6c (n.peerEpoch) LOOKING (my state)
      16:48:30,476 - INFO [prod1-zk2.messaging.gq1.yahoo.com/2.2.2.2:2183:QuorumCnxManager$Listener@743] - Received connection request /3.3.3.3:34092
      16:48:30,476 - INFO [WorkerSender[myid=2]:QuorumCnxManager@347] - Have smaller server identifier, so dropping the connection: (4, 2)
      :
      # After manually restarting follower host, finally follower hosts 
      16:50:53,986 - INFO [main:QuorumPeerConfig@398] - Defaulting to majority quorums
      16:50:53,991 - INFO [main:DatadirCleanupManager@78] - autopurge.snapRetainCount set to 3
      16:50:53,991 - INFO [main:DatadirCleanupManager@79] - autopurge.purgeInterval set to 1
      16:50:53,992 - INFO [PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started.
      16:50:54,003 - INFO [main:QuorumPeerMain@130] - Starting quorum peer
      16:50:54,026 - INFO [main:FileSnap@86] - Reading snapshot
      :
      16:55:04,377 - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@65] - FOLLOWING - LEADER ELECTION TOOK - 233617
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            rdhabalia Rajan Dhabalia
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: