ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1869

zk server falling apart from quorum due to connection loss and couldn't connect back

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: 3.5.0
    • Fix Version/s: None
    • Component/s: quorum
    • Labels:
      None
    • Environment:

      Using CentOS6 for running these zookeeper servers

      Description

      We have deployed zookeeper version 3.5.0.1515976, with 3 zk servers in the quorum.
      The problem we are facing is that one zookeeper server in the quorum falls apart, and never becomes part of the cluster until we restart zookeeper server on that node.

      Our interpretation from zookeeper logs on all nodes is as follows:
      (For simplicity assume S1=> zk server1, S2 => zk server2, S3 => zk server 3)
      Initially S3 is the leader while S1 and S2 are followers.

      S2 hits 46 sec latency while fsyncing write ahead log and results in loss of connection with S3.
      S3 in turn prints following error message:

      Unexpected exception causing shutdown while sock still open
      java.net.SocketTimeoutException: Read timed out
      Stack trace

                  • GOODBYE /169.254.1.2:47647(S2) ********

      S2 in this case closes connection with S3(leader) and shuts down follower with following log messages:
      Closing connection to leader, exception during packet send
      java.net.SocketException: Socket close
      Follower@194] - shutdown called
      java.lang.Exception: shutdown Follower

      After this point S3 could never reestablish connection with S2 and leader election mechanism keeps failing. S3 now keeps printing following message repeatedly:
      Cannot open channel to 2 at election address /169.254.1.2:3888
      java.net.ConnectException: Connection refused.

      While S3 is in this state, S2 repeatedly keeps printing following message:
      INFO [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /127.0.0.1:60667
      Exception causing close of session 0x0: ZooKeeperServer not running
      Closed socket connection for client /127.0.0.1:60667 (no session established for client)

      Leader election never completes successfully and causing S2 to fall apart from the quorum.
      S2 was out of quorum for almost 1 week.

      While debugging this issue, we found out that both election and peer connection ports on S2 can't be telneted from any of the node (S1, S2, S3). Network connectivity is not the issue. Later, we restarted the ZK server S2 (service zookeeper-server restart) – now we could telnet to both the ports and S2 joined the ensemble after a leader election attempt.
      Any idea what might be forcing S2 to get into a situation where it won't accept any connections on the leader election and peer connection ports?

        Activity

        Hide
        Deepak Jagtap added a comment -

        Logs for all the servers have been uploaded to following dropbox link:
        https://dl.dropboxusercontent.com/u/36429721/zklog.tgz

        Show
        Deepak Jagtap added a comment - Logs for all the servers have been uploaded to following dropbox link: https://dl.dropboxusercontent.com/u/36429721/zklog.tgz
        Hide
        Germán Blanco added a comment -

        In the log of server 2, something happens here:

        2014-01-05 20:04:37,651 [myid:2] - ERROR [/169.254.1.2:3888:QuorumCnxManager$Listener@567] - As I'm leaving the listener thread, I won't be able to participate in leader election any longer: /169.254.1.2:3888
        2014-01-05 20:04:37,675 [myid:2] - WARN  [WorkerSender[myid=2]:QuorumCnxManager@390] - Cannot open channel to 1 at election address /169.254.1.1:3888
        java.net.NoRouteToHostException: No route to host
        

        The first log message seems to indicate that server 2 won't be listening to FLE connections any more. Anybody knows if there is an automatic way to recover from this?
        The "No route to host" message occurs a lot. Was there any problem with the network hardware?

        Show
        Germán Blanco added a comment - In the log of server 2, something happens here: 2014-01-05 20:04:37,651 [myid:2] - ERROR [/169.254.1.2:3888:QuorumCnxManager$Listener@567] - As I'm leaving the listener thread, I won't be able to participate in leader election any longer: /169.254.1.2:3888 2014-01-05 20:04:37,675 [myid:2] - WARN [WorkerSender[myid=2]:QuorumCnxManager@390] - Cannot open channel to 1 at election address /169.254.1.1:3888 java.net.NoRouteToHostException: No route to host The first log message seems to indicate that server 2 won't be listening to FLE connections any more. Anybody knows if there is an automatic way to recover from this? The "No route to host" message occurs a lot. Was there any problem with the network hardware?
        Hide
        Deepak Jagtap added a comment -

        Servers are in virtual machines hence they are using vswitch for network connectivity.
        Hardware looks fine, since other network utilities scp and ssh are going fine.
        But we could not telnet zookeeper leader election port 2888 and 3888.
        Quorum works fine after restarting server2, hence assuming network should not be the problem.

        Please note that problem started from "2014-01-07 10:34:01,114" time on node 2.

        Show
        Deepak Jagtap added a comment - Servers are in virtual machines hence they are using vswitch for network connectivity. Hardware looks fine, since other network utilities scp and ssh are going fine. But we could not telnet zookeeper leader election port 2888 and 3888. Quorum works fine after restarting server2, hence assuming network should not be the problem. Please note that problem started from "2014-01-07 10:34:01,114" time on node 2.
        Hide
        Deepak Jagtap added a comment -

        Hi German,

        Did you get a chance to look at the log? Any idea is this a zokeeper bug or
        any other configuration issue?

        Thanks & Regards,
        Deepak

        Show
        Deepak Jagtap added a comment - Hi German, Did you get a chance to look at the log? Any idea is this a zokeeper bug or any other configuration issue? Thanks & Regards, Deepak
        Hide
        Germán Blanco added a comment -

        Hello Deepak,

        yes, I looked at the logs.
        It could be a zookeeper bug, caused by network instability. As indicated in my previous post (see the second post in this JIRA), the code reaches a point that I haven´t seen until now in other logs, and the log indicates that the server will not be listening on FLE ports any longer. If we can verify that your problem happens when that log is produced, then we have located the bug. Maybe you could try to reproduce the situation of when the problem happened or maybe we could generate a zookeeper jar that jumps to that part of the code under some condition. If you then reach the same state as in your initial problem, we know what we have to fix. Do you think you could please do that?
        As said, the root cause of the problem seems to be network instability. By looking at the logs, I can see a lot of indications of something that is not going nicely with connections, I can't tell if that is something that you can solve or not in your environment. But that doesn't change the fact that there might be something to correct in ZooKeeper.

        Regards,

        German.

        Show
        Germán Blanco added a comment - Hello Deepak, yes, I looked at the logs. It could be a zookeeper bug, caused by network instability. As indicated in my previous post (see the second post in this JIRA), the code reaches a point that I haven´t seen until now in other logs, and the log indicates that the server will not be listening on FLE ports any longer. If we can verify that your problem happens when that log is produced, then we have located the bug. Maybe you could try to reproduce the situation of when the problem happened or maybe we could generate a zookeeper jar that jumps to that part of the code under some condition. If you then reach the same state as in your initial problem, we know what we have to fix. Do you think you could please do that? As said, the root cause of the problem seems to be network instability. By looking at the logs, I can see a lot of indications of something that is not going nicely with connections, I can't tell if that is something that you can solve or not in your environment. But that doesn't change the fact that there might be something to correct in ZooKeeper. Regards, German.
        Hide
        Deepak Jagtap added a comment -

        Hi German,

        Unfortunately this issue occurred at customer site and hence difficult to reproduce it.
        If we managed to reproduce this in the lab will update you with new set of logs.

        Thanks & Regards,
        Deepak

        Show
        Deepak Jagtap added a comment - Hi German, Unfortunately this issue occurred at customer site and hence difficult to reproduce it. If we managed to reproduce this in the lab will update you with new set of logs. Thanks & Regards, Deepak
        Hide
        Deepak Jagtap added a comment -

        Hi German,

        I observed another instance where zookeeper cluster is falling apart.
        But in this case leader election was happening in loop every couple of minutes.
        While in the previous case leader election never completed for almost 3-4 days.
        I am using bit older revision of zookeeper 3.5.0 (released some time in March),
        any idea if there any fixes related to these bugs.

        Please find more description this issue:
        I have 3 node zookeeper 3.5.0.1458648 quorum on my setup.
        We came across a situation where one of the zk server in the cluster went down due to bad disk.
        We observed that leader election keeps running in loop (starts, completes and again starts). The loop repeats every couple of minutes. Even restarting zookeeper server on both nodes doesn't help recovering from this loop.
        Network connection looks fine though, as I could telnet leader election port and ssh from one node to other.
        zookeeper client on each node is using "127.0.0.1:2181" as quorum string for connecting to server, therefore if local zookeeper server is down client app is dead.

        I have uploaded zookeeper.log for both nodes at following link:
        https://dl.dropboxusercontent.com/u/36429721/zkSupportLog.tar.gz

        Any idea what might be wrong with the quorum? Please note that restarting zookeeper server on both nodes doesn't help to recover from this situations.

        Thanks & Regards,
        Deepak

        Show
        Deepak Jagtap added a comment - Hi German, I observed another instance where zookeeper cluster is falling apart. But in this case leader election was happening in loop every couple of minutes. While in the previous case leader election never completed for almost 3-4 days. I am using bit older revision of zookeeper 3.5.0 (released some time in March), any idea if there any fixes related to these bugs. Please find more description this issue: I have 3 node zookeeper 3.5.0.1458648 quorum on my setup. We came across a situation where one of the zk server in the cluster went down due to bad disk. We observed that leader election keeps running in loop (starts, completes and again starts). The loop repeats every couple of minutes. Even restarting zookeeper server on both nodes doesn't help recovering from this loop. Network connection looks fine though, as I could telnet leader election port and ssh from one node to other. zookeeper client on each node is using "127.0.0.1:2181" as quorum string for connecting to server, therefore if local zookeeper server is down client app is dead. I have uploaded zookeeper.log for both nodes at following link: https://dl.dropboxusercontent.com/u/36429721/zkSupportLog.tar.gz Any idea what might be wrong with the quorum? Please note that restarting zookeeper server on both nodes doesn't help to recover from this situations. Thanks & Regards, Deepak
        Hide
        Camille Fournier added a comment -

        These are some crazy logs. Unfortunately for the second case I think you're too far behind master for us to do much with it.
        As far as I can tell, the two remaining servers attempt to form a quorum, but every time the follower gets UPTODATE, it fails to send the ACK packet so that the leader can read it (we see a broken pipe on the follower in writePacket, and a read time out on the leader when trying to deserialize the type of the packet). Is it possible that you were running with 2 different generated protocol definitions?

        Show
        Camille Fournier added a comment - These are some crazy logs. Unfortunately for the second case I think you're too far behind master for us to do much with it. As far as I can tell, the two remaining servers attempt to form a quorum, but every time the follower gets UPTODATE, it fails to send the ACK packet so that the leader can read it (we see a broken pipe on the follower in writePacket, and a read time out on the leader when trying to deserialize the type of the packet). Is it possible that you were running with 2 different generated protocol definitions?
        Hide
        Deepak Jagtap added a comment -

        All servers have identical zookeeper version running, so most likely there shouldn't be mismatch in protocol definitions. Leader election goes though fine till sync stage so network connectivity is fine. Also verified that ssh and ping are fine among the nodes.

        Show
        Deepak Jagtap added a comment - All servers have identical zookeeper version running, so most likely there shouldn't be mismatch in protocol definitions. Leader election goes though fine till sync stage so network connectivity is fine. Also verified that ssh and ping are fine among the nodes.

          People

          • Assignee:
            Unassigned
            Reporter:
            Deepak Jagtap
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Development