We found a bug similar to
ZOOKEEPER-2029. Leader.LearnerCnxAcceptor is a ZooKeeperCriticalThread and its exception is handled by ZooKeeperCriticalThread#handleException, which is supposed to shutdown the process or rejoin the quorum. However, in some concurrency condition, this correct handling does not occur, and thus this leader without LearnerCnxAcceptor will not accept any follower, which is the same symptom as ZOOKEEPER-2029. Ater reproduction, we confirmed that this bug exists in both the 3.6.2 release version and the master branch.
This concurrency condition can be constructed as follows: start a ZooKeeper cluster of 3 nodes, and the ServerSocket#accept invocation throws an exception in the LearnerCnxAcceptorHandler for the second follower that tries to join the quorum. If these 3 nodes get started almost simultaneously, e.g., within 2 seconds, then The aforementioned symptom may occur. In the log, we can observe that one of the followers keeps trying to join the quorum and keeps failing, and the other 2 servers always get the request from this follower but never let it join the quorum.
We prepared the reproduction scripts in a gist link (https://gist.github.com/functioner/d5aad57669f4e4b92b4a8757fa2d0720), and the throwing exception of ServerSocket#accept is done by the Byteman injection script `serverSocketAccept-exception.btm`.
The root cause of this bug is that when the Leader.LearnerCnxAcceptor thread handles an exception in ZooKeeperCriticalThread#handleException, the exception is not really "handled", meaning that it basically does nothing but set the ZooKeeperServerState as ERROR in the following stack trace: ZooKeeperCriticalThread#handleException -> ZooKeeperServerListenerImpl#notifyStopping -> QuorumZooKeeperServer#setState. The QuorumZooKeeperServer#setState method basically setting the state as ZooKeeperServer.State.ERROR in this scenario.
Under normal Circumstances, this ERROR state will be detected by another thread in the following stack trace: QuorumPeer#run -> Leader#lead -> Leader#isRunning -> ZooKeeperServer#isRunning. In the code, it is within the infinite while loop at the end of Leader#lead (https://github.com/apache/zookeeper/blob/release-3.6.2/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L758).
If the ERROR state from Leader.LearnerCnxAcceptor or any other ZooKeeperCriticalThread could be detected in this way, then the leader would be aware of the ERROR state and turn to https://github.com/apache/zookeeper/blob/release-3.6.2/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L779, which is correct.
However, in some concurrency condition, the ERROR state can't be detected in this way, because in Leader#lead, https://github.com/apache/zookeeper/blob/release-3.6.2/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L685 can cover this ERROR state with a RUNNING state in the following stack trace: Leader#lead -> Leader#startZkServer -> LeaderZooKeeperServer#startup -> ZooKeeperServer#startup -> QuorumZooKeeperServer#setState. Therefore, if the ERROR state occurs before the invocation of Leader#startZkServer in Leader#lead, then this ERROR state will be covered, because QuorumZooKeeperServer#setState does not record or handle the old state.
The reproduction script we provided can construct this concurrency condition, because, usually, after the server startup, it takes some time for the QuorumPeer thread to reach https://github.com/apache/zookeeper/blob/release-3.6.2/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L685. Thus the exception LearnerCnxAcceptorHandler, if any, usually occurs earlier than that. Then the aforementioned symptom happens.
In terms of the fix, we basically add a sanity check to prevent the RUNNING/INITIAL state from covering the possible ERROR state.
Notice that this fix also influences the follower, which also uses the QuorumZooKeeperServer#setState method. Given that the follower may be affected by the similar symptom (RUNNING state covers ERROR state in the server startup), this fix solves this potential issue, too.
The fix can be confirmed by using same reproduction script and then observe the behavior of the servers. The leader should be able to recover from the fault and all followers are able to join the quorum.