[ZOOKEEPER-4781] ZooKeeper not starting because the accepted epoch is less than the current epoch. - ASF JIRA

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.5.10, 3.4.14, 3.6.4, 3.7.2, 3.8.3, 3.9.1
Fix Version/s: None
Component/s: server
Labels:
None

Description

This issue occurred in our abnormal testing environment, where the disk was injected with anomalies and frequently filled up.

The the scenario is as follows:

Configure three node ZooKeeper cluster, lets say nodes are A, B and C.
Start the cluster, and node C becomes the leader.
The disk of Node C is injected with a full disk exception.
Node C called the org.apache.zookeeper.server.quorum.Leader#lead method.

void lead() throws IOException, InterruptedException {
    self.end_fle = Time.currentElapsedTime();
    long electionTimeTaken = self.end_fle - self.start_fle;
    self.setElectionTimeTaken(electionTimeTaken);
    LOG.info("LEADING - LEADER ELECTION TOOK - {}", electionTimeTaken);
    self.start_fle = 0;
    self.end_fle = 0;

    zk.registerJMX(new LeaderBean(this, zk), self.jmxLocalPeerBean);

    try {
        self.tick.set(0);
        zk.loadData();
        
        leaderStateSummary = new StateSummary(self.getCurrentEpoch(), zk.getLastProcessedZxid());

        // Start thread that waits for connection requests from 
        // new followers.
        cnxAcceptor = new LearnerCnxAcceptor();
        cnxAcceptor.start();
        
        readyToStart = true;
        long epoch = getEpochToPropose(self.getId(), self.getAcceptedEpoch());
        
        zk.setZxid(ZxidUtils.makeZxid(epoch, 0));
        
        synchronized(this){
            lastProposed = zk.getZxid();
        }
        
        newLeaderProposal.packet = new QuorumPacket(NEWLEADER, zk.getZxid(),
                null, null);


        if ((newLeaderProposal.packet.getZxid() & 0xffffffffL) != 0) {
            LOG.info("NEWLEADER proposal has Zxid of "
                    + Long.toHexString(newLeaderProposal.packet.getZxid()));
        }
        
        waitForEpochAck(self.getId(), leaderStateSummary);
        self.setCurrentEpoch(epoch);
...

Node C, as the leader, will start the LearnerCnxAcceptor thread, and then call the org.apache.zookeeper.server.quorum.Leader#getEpochToPropose method. At this time, the value of waitingForNewEpoch is true, and the size of connectingFollowers is not greater than n/2. Node C directly calls connectingFollowers.wait to wait. The maximum waiting time is self.getInitLimit()*self.getTickTime() ms.

public long getEpochToPropose(long sid, long lastAcceptedEpoch) throws InterruptedException, IOException {
    synchronized(connectingFollowers) {
        if (!waitingForNewEpoch) {
            return epoch;
        }
        if (lastAcceptedEpoch >= epoch) {
            epoch = lastAcceptedEpoch+1;
        }
        if (isParticipant(sid)) {
            connectingFollowers.add(sid);
        }
        QuorumVerifier verifier = self.getQuorumVerifier();
        if (connectingFollowers.contains(self.getId()) && 
                                        verifier.containsQuorum(connectingFollowers)) {
            self.setAcceptedEpoch(epoch);
            waitingForNewEpoch = false;
            connectingFollowers.notifyAll();
        } else {
            long start = Time.currentElapsedTime();
            long cur = start;
            long end = start + self.getInitLimit()*self.getTickTime();
            while(waitingForNewEpoch && cur < end) {
                connectingFollowers.wait(end - cur);
                cur = Time.currentElapsedTime();
            }
            if (waitingForNewEpoch) {
                throw new InterruptedException("Timeout while waiting for epoch from quorum");        
            }
        }
        return epoch;
    }
}

Node B connects to the 2888 communication port of node C and starts a new LeanerHandler thread.
Node A connects to the 2888 communication port of node C and starts a new LeanerHandler thread.
After node B connects to node C, call the org.apache.zookeeper.server.quorum.Leader#getEpochToPropose method in the LearnerHandler thread.At this point, the value of waitingForNewEpoch is true, and the size of connectingFollowers is greater than n/2. Then, set the value of waitingForNewEpoch to false. Due to the disk of node C being full, calling setAcceptedEpoch to write the acceptedEpoch value failed with an IO exception. Node C fails to update the acceptedEpoch file and did not successfully call the connectingFollowers.notifyAll() method. This will cause node C to wait at connectingFollowers.wait, with a maximum wait of self.getInitLimit()*self.getTickTime() ms. Due to an IO exception thrown, the socket connection between node B and node C was disconnected, and the LeanerHandler thread exited.

2023-12-15 15:08:35,572 [myid:3] - ERROR [LearnerHandler-/192.168.174.227:41696:LearnerHandler@648] - Unexpected exception causing shutdown while sock still open2023-12-15 15:08:35,572 [myid:3] - ERROR [LearnerHandler-/192.168.174.227:41696:LearnerHandler@648] - Unexpected exception causing shutdown while sock still openjava.io.FileNotFoundException: /cloud/data/zookeeper/data/version-2/acceptedEpoch.tmp (No space left on device) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.<init>(FileOutputStream.java:213) at java.io.FileOutputStream.<init>(FileOutputStream.java:162) at org.apache.zookeeper.common.AtomicFileOutputStream.<init>(AtomicFileOutputStream.java:59) at org.apache.zookeeper.server.quorum.QuorumPeer.writeLongToFile(QuorumPeer.java:1393) at org.apache.zookeeper.server.quorum.QuorumPeer.setAcceptedEpoch(QuorumPeer.java:1437) at org.apache.zookeeper.server.quorum.Leader.getEpochToPropose(Leader.java:885) at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:358)

After node A connects to node C, the org.apache.zookeeper.server.quorum.Leader#getEpochToPropose method is called in the LearnerHandler thread. At this time, the value of waitingForNewEpoch is false, and epoch is returned directly.
Node B reconnects to the 2888 cluster communication port of node C and starts a new LeanerHandler thread. Call the org.apache.zookeeper.server.quorum.Leader#getEpochToPropose method in the LearnerHandler thread. At this time, the value of waitingForNewEpoch is false, and epoch is returned directly.
After node A connects to node C, the LeanerHandler thread will then call the org.apache.zookeeper.server.quorum.Leader#waitForEpochAck method. At this point, selectingFollowers does not include node C, Call the selectingFollowers.wait method to wait.
After node B connects to node C, the LeanerHandler thread will then call the org.apache.zookeeper.server.quorum.Leader#waitForEpochAck method. At this point, selectingFollowers does not include node C, Call the selectingFollowers.wait method to wait.

public void waitForEpochAck(long id, StateSummary ss) throws IOException, InterruptedException {
    synchronized(electingFollowers) {
        if (electionFinished) {
            return;
        }
        if (ss.getCurrentEpoch() != -1) {
            if (ss.isMoreRecentThan(leaderStateSummary)) {
                throw new IOException("Follower is ahead of the leader, leader summary: " 
                                                + leaderStateSummary.getCurrentEpoch()
                                                + " (current epoch), "
                                                + leaderStateSummary.getLastZxid()
                                                + " (last zxid)");
            }
            if (isParticipant(id)) {
                electingFollowers.add(id);
            }
        }
        QuorumVerifier verifier = self.getQuorumVerifier();
        if (electingFollowers.contains(self.getId()) && verifier.containsQuorum(electingFollowers)) {
            electionFinished = true;
            electingFollowers.notifyAll();
        } else {                
            long start = Time.currentElapsedTime();
            long cur = start;
            long end = start + self.getInitLimit()*self.getTickTime();
            while(!electionFinished && cur < end) {
                electingFollowers.wait(end - cur);
                cur = Time.currentElapsedTime();
            }
            if (!electionFinished) {
                throw new InterruptedException("Timeout while waiting for epoch to be acked by quorum");
            }
        }
    }
}

Node C waits for self.getInitLimit()*self.getTickTime() ms in the connectingFollowers.wait method. However, since the value of waitingForNewEpoch is false, it does not throw a timeout exception and returns the epoch directly.
Node C calls the org.apache.zookeeper.server.quorum.Leader#waitForEpochAck method. At this time, selectingFollowers contains node C and the size of selectingFollowers is greater than n/2. Then call the selectingFollowers.notifyAll() method to release the lock.
The disk space of node C has been partially released, so subsequent calls to setCurrentEpoch have successfully updated the currentEpoch file.
The cluster provides services normally for a period of time.
Stop C.
Start C, bellow exception with message "The accepted epoch, c is less than the current epoch, d" is thrown.

2023-12-15 15:57:00,051 [myid:3] - ERROR [main:QuorumPeer@698] - Unable to load database on disk
java.io.IOException: The accepted epoch, c is less than the current epoch, d
    at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:695)
    at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:636)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:170)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:114)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:81)
2023-12-15 15:57:00,132 [myid:3] - ERROR [main:QuorumPeerMain@92] - Unexpected exception, exiting abnormally
java.lang.RuntimeException: Unable to run quorum server 
    at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:699)
    at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:636)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:170)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:114)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:81)
Caused by: java.io.IOException: The accepted epoch, c is less than the current epoch, d
    at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:695)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

0001-zk-leader-acceptedEpoch-zk-zk.patch
21/Dec/23 11:55
3 kB
zhanglu153

Issue Links

is related to

ZOOKEEPER-1291 AcceptedEpoch not updated at leader before it proposes the epoch to followers

Closed

ZooKeeper not starting because the accepted epoch is less than the current epoch.

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates