Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-3909

Zookeeper Unable to Join the Cluster after it is Restarted; Error: "This ZooKeeper instance is not currently serving requests"

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: 3.5.7
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      All Environments 

      Description

      When we restart a zookeeper, it doesn't successfully join the cluster and start serving clients. We see the zookeeper services starts successfully, but it stays ideal and throws the message: "This ZooKeeper instance is not currently serving requests"

      The Zookeeper cluster size is 5. Whenever we feel the need of restarting the zookeepers, we do one at a time. There are two ways we restart the zookeepers,

      1. just stop the services and start it back up again.
      2. stop the services, replace the host, and start it back up again.

      And, in both the cases we see the same issue.

      -----------

      When investigated the zookeepers logs, we see the below errors/warnings,

      "[QuorumPeer[myid=1](plain=x.x.x.x:0000)(secure=disabled)] WARN  org.apache.zookeeper.server.quorum.Learner - Exception when following the leader
      java.io.IOException: Leaders epoch, xx is less than accepted epoch, xy"

      -------------------------

      But, when we check the current epoch of the leader is always same as the accepted epoch, which is also matches of the zookeeper we are trying to bring back to the quorum.

      ------------------------

      Also, when we get the Zxid of every quorum member, they have the same first byte; only the last two numbers change, so we can safely assume that they are in sync, I guess.

      Somehow this zookeeper that we re restarting sees an advancing of the epoch and shuts down as a follower.

      --------------

      The current solution we have at the moment for this issue is,

      stop the zookeeper services --> rename the current zookeeper data directory (version-2) --> start it backup again.

      It immediately joins the cluster as a follower as it doesn't have any idea of the epoch and start serving clients. 

      ----------

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              Bhoi Saswati
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: