Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-4781

ZooKeeper not starting because the accepted epoch is less than the current epoch.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.5.10, 3.4.14, 3.6.4, 3.7.2, 3.8.3, 3.9.1
    • None
    • server
    • None

    Description

      This issue occurred in our abnormal testing environment, where the disk was injected with anomalies and frequently filled up.

      The the scenario is as follows:

      • Configure three node ZooKeeper cluster, lets say nodes are A, B and C.
      • Start the cluster, and node C becomes the leader.
      • The disk of Node C is injected with a full disk exception.
      • Node C called the org.apache.zookeeper.server.quorum.Leader#lead method.
      void lead() throws IOException, InterruptedException {
          self.end_fle = Time.currentElapsedTime();
          long electionTimeTaken = self.end_fle - self.start_fle;
          self.setElectionTimeTaken(electionTimeTaken);
          LOG.info("LEADING - LEADER ELECTION TOOK - {}", electionTimeTaken);
          self.start_fle = 0;
          self.end_fle = 0;
      
          zk.registerJMX(new LeaderBean(this, zk), self.jmxLocalPeerBean);
      
          try {
              self.tick.set(0);
              zk.loadData();
              
              leaderStateSummary = new StateSummary(self.getCurrentEpoch(), zk.getLastProcessedZxid());
      
              // Start thread that waits for connection requests from 
              // new followers.
              cnxAcceptor = new LearnerCnxAcceptor();
              cnxAcceptor.start();
              
              readyToStart = true;
              long epoch = getEpochToPropose(self.getId(), self.getAcceptedEpoch());
              
              zk.setZxid(ZxidUtils.makeZxid(epoch, 0));
              
              synchronized(this){
                  lastProposed = zk.getZxid();
              }
              
              newLeaderProposal.packet = new QuorumPacket(NEWLEADER, zk.getZxid(),
                      null, null);
      
      
              if ((newLeaderProposal.packet.getZxid() & 0xffffffffL) != 0) {
                  LOG.info("NEWLEADER proposal has Zxid of "
                          + Long.toHexString(newLeaderProposal.packet.getZxid()));
              }
              
              waitForEpochAck(self.getId(), leaderStateSummary);
              self.setCurrentEpoch(epoch);
      ... 
      • Node C, as the leader, will start the LearnerCnxAcceptor thread, and then call the org.apache.zookeeper.server.quorum.Leader#getEpochToPropose method. At this time, the value of waitingForNewEpoch is true, and the size of connectingFollowers is not greater than n/2. Node C directly calls connectingFollowers.wait to wait. The maximum waiting time is self.getInitLimit()*self.getTickTime() ms.
      public long getEpochToPropose(long sid, long lastAcceptedEpoch) throws InterruptedException, IOException {
          synchronized(connectingFollowers) {
              if (!waitingForNewEpoch) {
                  return epoch;
              }
              if (lastAcceptedEpoch >= epoch) {
                  epoch = lastAcceptedEpoch+1;
              }
              if (isParticipant(sid)) {
                  connectingFollowers.add(sid);
              }
              QuorumVerifier verifier = self.getQuorumVerifier();
              if (connectingFollowers.contains(self.getId()) && 
                                              verifier.containsQuorum(connectingFollowers)) {
                  self.setAcceptedEpoch(epoch);
                  waitingForNewEpoch = false;
                  connectingFollowers.notifyAll();
              } else {
                  long start = Time.currentElapsedTime();
                  long cur = start;
                  long end = start + self.getInitLimit()*self.getTickTime();
                  while(waitingForNewEpoch && cur < end) {
                      connectingFollowers.wait(end - cur);
                      cur = Time.currentElapsedTime();
                  }
                  if (waitingForNewEpoch) {
                      throw new InterruptedException("Timeout while waiting for epoch from quorum");        
                  }
              }
              return epoch;
          }
      }
      • Node B connects to the 2888 communication port of node C and starts a new LeanerHandler thread.
      • Node A connects to the 2888 communication port of node C and starts a new LeanerHandler thread.
      • After node B connects to node C, call the org.apache.zookeeper.server.quorum.Leader#getEpochToPropose method in the LearnerHandler thread.At this point, the value of waitingForNewEpoch is true, and the size of connectingFollowers is greater than n/2. Then, set the value of waitingForNewEpoch to false. Due to the disk of node C being full, calling setAcceptedEpoch to write the acceptedEpoch value failed with an IO exception. Node C fails to update the acceptedEpoch file and did not successfully call the connectingFollowers.notifyAll() method. This will cause node C to wait at connectingFollowers.wait, with a maximum wait of self.getInitLimit()*self.getTickTime() ms. Due to an IO exception thrown, the socket connection between node B and node C was disconnected, and the LeanerHandler thread exited.
      2023-12-15 15:08:35,572 [myid:3] - ERROR [LearnerHandler-/192.168.174.227:41696:LearnerHandler@648] - Unexpected exception causing shutdown while sock still open2023-12-15 15:08:35,572 [myid:3] - ERROR [LearnerHandler-/192.168.174.227:41696:LearnerHandler@648] - Unexpected exception causing shutdown while sock still openjava.io.FileNotFoundException: /cloud/data/zookeeper/data/version-2/acceptedEpoch.tmp (No space left on device) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.<init>(FileOutputStream.java:213) at java.io.FileOutputStream.<init>(FileOutputStream.java:162) at org.apache.zookeeper.common.AtomicFileOutputStream.<init>(AtomicFileOutputStream.java:59) at org.apache.zookeeper.server.quorum.QuorumPeer.writeLongToFile(QuorumPeer.java:1393) at org.apache.zookeeper.server.quorum.QuorumPeer.setAcceptedEpoch(QuorumPeer.java:1437) at org.apache.zookeeper.server.quorum.Leader.getEpochToPropose(Leader.java:885) at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:358)  
      • After node A connects to node C, the org.apache.zookeeper.server.quorum.Leader#getEpochToPropose method is called in the LearnerHandler thread. At this time, the value of waitingForNewEpoch is false, and epoch is returned directly.
      • Node B reconnects to the 2888 cluster communication port of node C and starts a new LeanerHandler thread. Call the org.apache.zookeeper.server.quorum.Leader#getEpochToPropose method in the LearnerHandler thread. At this time, the value of waitingForNewEpoch is false, and epoch is returned directly.
      • After node A connects to node C, the LeanerHandler thread will then call the org.apache.zookeeper.server.quorum.Leader#waitForEpochAck method. At this point, selectingFollowers does not include node C, Call the selectingFollowers.wait method to wait.
      • After node B connects to node C, the LeanerHandler thread will then call the org.apache.zookeeper.server.quorum.Leader#waitForEpochAck method. At this point, selectingFollowers does not include node C, Call the selectingFollowers.wait method to wait.
      public void waitForEpochAck(long id, StateSummary ss) throws IOException, InterruptedException {
          synchronized(electingFollowers) {
              if (electionFinished) {
                  return;
              }
              if (ss.getCurrentEpoch() != -1) {
                  if (ss.isMoreRecentThan(leaderStateSummary)) {
                      throw new IOException("Follower is ahead of the leader, leader summary: " 
                                                      + leaderStateSummary.getCurrentEpoch()
                                                      + " (current epoch), "
                                                      + leaderStateSummary.getLastZxid()
                                                      + " (last zxid)");
                  }
                  if (isParticipant(id)) {
                      electingFollowers.add(id);
                  }
              }
              QuorumVerifier verifier = self.getQuorumVerifier();
              if (electingFollowers.contains(self.getId()) && verifier.containsQuorum(electingFollowers)) {
                  electionFinished = true;
                  electingFollowers.notifyAll();
              } else {                
                  long start = Time.currentElapsedTime();
                  long cur = start;
                  long end = start + self.getInitLimit()*self.getTickTime();
                  while(!electionFinished && cur < end) {
                      electingFollowers.wait(end - cur);
                      cur = Time.currentElapsedTime();
                  }
                  if (!electionFinished) {
                      throw new InterruptedException("Timeout while waiting for epoch to be acked by quorum");
                  }
              }
          }
      } 
      • Node C waits for self.getInitLimit()*self.getTickTime() ms in the connectingFollowers.wait method. However, since the value of waitingForNewEpoch is false, it does not throw a timeout exception and returns the epoch directly.
      • Node C calls the org.apache.zookeeper.server.quorum.Leader#waitForEpochAck method. At this time, selectingFollowers contains node C and the size of selectingFollowers is greater than n/2. Then call the selectingFollowers.notifyAll() method to release the lock.
      • The disk space of node C has been partially released, so subsequent calls to setCurrentEpoch have successfully updated the currentEpoch file.
      • The cluster provides services normally for a period of time.
      • Stop C.
      • Start C, bellow exception with message "The accepted epoch, c is less than the current epoch, d" is thrown.
      2023-12-15 15:57:00,051 [myid:3] - ERROR [main:QuorumPeer@698] - Unable to load database on disk
      java.io.IOException: The accepted epoch, c is less than the current epoch, d
          at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:695)
          at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:636)
          at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:170)
          at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:114)
          at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:81)
      2023-12-15 15:57:00,132 [myid:3] - ERROR [main:QuorumPeerMain@92] - Unexpected exception, exiting abnormally
      java.lang.RuntimeException: Unable to run quorum server 
          at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:699)
          at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:636)
          at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:170)
          at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:114)
          at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:81)
      Caused by: java.io.IOException: The accepted epoch, c is less than the current epoch, d
          at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:695) 

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              zhanglu153 zhanglu153
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: