Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-4203

Leader swallows the ZooKeeperServer.State.ERROR from Leader.LearnerCnxAcceptor in some concurrency condition

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: 3.6.2
    • Fix Version/s: None
    • Component/s: server

      Description

      We found a bug similar to ZOOKEEPER-2029. Leader.LearnerCnxAcceptor is a ZooKeeperCriticalThread and its exception is handled by ZooKeeperCriticalThread#handleException, which is supposed to shutdown the process or rejoin the quorum. However, in some concurrency condition, this correct handling does not occur, and thus this leader without LearnerCnxAcceptor will not accept any follower, which is the same symptom as ZOOKEEPER-2029. Ater reproduction, we confirmed that this bug exists in both the 3.6.2 release version and the master branch.

      Reproduction

      This concurrency condition can be constructed as follows: start a ZooKeeper cluster of 3 nodes, and the ServerSocket#accept invocation throws an exception in the LearnerCnxAcceptorHandler for the second follower that tries to join the quorum. If these 3 nodes get started almost simultaneously, e.g., within 2 seconds, then The aforementioned symptom may occur. In the log, we can observe that one of the followers keeps trying to join the quorum and keeps failing, and the other 2 servers always get the request from this follower but never let it join the quorum.

      We prepared the reproduction scripts in a gist link (https://gist.github.com/functioner/d5aad57669f4e4b92b4a8757fa2d0720), and the throwing exception of ServerSocket#accept is done by the Byteman injection script `serverSocketAccept-exception.btm`.

      Diagnosis

      The root cause of this bug is that when the Leader.LearnerCnxAcceptor thread handles an exception in ZooKeeperCriticalThread#handleException, the exception is not really "handled", meaning that it basically does nothing but set the ZooKeeperServerState as ERROR in the following stack trace: ZooKeeperCriticalThread#handleException -> ZooKeeperServerListenerImpl#notifyStopping -> QuorumZooKeeperServer#setState. The QuorumZooKeeperServer#setState method basically setting the state as ZooKeeperServer.State.ERROR in this scenario.

      Under normal Circumstances, this ERROR state will be detected by another thread in the following stack trace: QuorumPeer#run -> Leader#lead -> Leader#isRunning -> ZooKeeperServer#isRunning. In the code, it is within the infinite while loop at the end of Leader#lead (https://github.com/apache/zookeeper/blob/release-3.6.2/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L758).

              void lead() throws IOException, InterruptedException {
                          // ...
                          startZkServer();                       // line 685
                          // ...
                          String shutdownMessage = null;
                          while (true) {
                              // ...
                              synchronized (this) {
                                  // ...
                                  if (!this.isRunning()) {       // line 758
                                      shutdownMessage = "Unexpected internal error";
                                      break;
                                  }
                                  // ...
                              }
                              // ...
                          }
                          if (shutdownMessage != null) {
                              shutdown(shutdownMessage);         // line 779
                          }
                          // ...
              }
      

      If the ERROR state from Leader.LearnerCnxAcceptor or any other ZooKeeperCriticalThread could be detected in this way, then the leader would be aware of the ERROR state and turn to https://github.com/apache/zookeeper/blob/release-3.6.2/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L779, which is correct.

      However, in some concurrency condition, the ERROR state can't be detected in this way, because in Leader#lead, https://github.com/apache/zookeeper/blob/release-3.6.2/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L685 can cover this ERROR state with a RUNNING state in the following stack trace: Leader#lead -> Leader#startZkServer -> LeaderZooKeeperServer#startup -> ZooKeeperServer#startup -> QuorumZooKeeperServer#setState. Therefore, if the ERROR state occurs before the invocation of Leader#startZkServer in Leader#lead, then this ERROR state will be covered, because QuorumZooKeeperServer#setState does not record or handle the old state.

      The reproduction script we provided can construct this concurrency condition, because, usually, after the server startup, it takes some time for the QuorumPeer thread to reach https://github.com/apache/zookeeper/blob/release-3.6.2/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L685. Thus the exception LearnerCnxAcceptorHandler, if any, usually occurs earlier than that. Then the aforementioned symptom happens.

      Fix

      In terms of the fix, we basically add a sanity check to prevent the RUNNING/INITIAL state from covering the possible ERROR state.

      Notice that this fix also influences the follower, which also uses the QuorumZooKeeperServer#setState method. Given that the follower may be affected by the similar symptom (RUNNING state covers ERROR state in the server startup), this fix solves this potential issue, too.

      The fix can be confirmed by using same reproduction script and then observe the behavior of the servers. The leader should be able to recover from the fault and all followers are able to join the quorum.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                functioner Haoze Wu
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 4h
                  4h