Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-4537

Race between SyncThread and CommitProcessor thread

    XMLWordPrintableJSON

Details

    Description

      Zookeeper server can get stuck when it has just one client and the only way it recovers is due to a socket timeout or another client commit request.

      Details:

      We have a system where a system manager app (SM) launches all the other applications based on some config criteria.  When system boots up, SM connects to zookeeper and is the only client talking to zookeeper until all other apps are launched. SM does a bunch of sync create() calls to zookeeper, before it starts up all the other apps. So at this point zookeeper server has got just one connection, which is from SM. 1 out of 3 times during system startup, SM gets stuck at a random create calls and there are only two ways this gets unwedged.
      1. The socket has to time out and we get ZOPERATIONTIMEOUT
      2. If we start another connection to zookeeper (I used zkCli.sh to do this), this unwedges the exiting connection to SM.
       
      From strace I can see that there is a race between SyncThread and CommitProcessor thread.
       
      Sync thread
      class CommitProcessor {
      run() {
                  int requestsToProcess = 0;
                  boolean commitIsWaiting = false;
                  do {
                      commitIsWaiting = !committedRequests.isEmpty(); <<<<< we are checking if there are more messages to process here
                      requestsToProcess = queuedRequests.size();
                      // Avoid sync if we have something to do
                      if (requestsToProcess == 0 && !commitIsWaiting) {
                          // Waiting for requests to process
                          synchronized (this) {
                              while (!stopped && requestsToProcess == 0 && !commitIsWaiting)

      { <<<<< and acting on the information read above inside the sync block                             wait(); <<<<<< wait here                             commitIsWaiting = !committedRequests.isEmpty();                             requestsToProcess = queuedRequests.size();                         }

                          }
                      }
       
       
      Commit thread
          public void commit(Request request) {
              if (stopped || request == null)

      {             return;         }

              LOG.debug("Committing request:: {}", request);
              request.commitRecvTime = Time.currentElapsedTime();
              ServerMetrics.getMetrics().COMMITS_QUEUED.add(1);
              committedRequests.add(request); <<<<<< enqueue messages here
              wakeup();
          }
       
          @SuppressFBWarnings("NN_NAKED_NOTIFY")
          private synchronized void wakeup()

      { <<<<< wakeup call is synchronized         notifyAll();     }

       
       
      Now lets consider the following scenario
        1. Sync thread reads commitIsWaiting and there are no commits pending.
        2. This thread gets scheduled out
        3. We got a commit request – CommitProcessor thread adds the request to committedRequests and calls wakeup
        4. CommitProcessor goes ahead and does a notifyAll().
        5. Since the sync thread has not reached the wait() yet, there is no one to wake up.
        6. Sync thread gets scheduled back in.
        7. It goes ahead and does a wait() but since there are no other connections or new commit requests no one does a wakeup(). So this thread waits here until the socket is timed out.
           7a. Or if another commit is made where we endup calling notifyAll() which wakes up the waiting thread.
       
      I modified the CommitProcessor::run() like this
                  do {
                      /*
                       * Since requests are placed in the queue before being sent to
                       * the leader, if commitIsWaiting = true, the commit belongs to
                       * the first update operation in the queuedRequests or to a
                       * request from a client on another server (i.e., the order of
                       * the following two lines is important!).
                       */
                          synchronized (this) { <<<<<
                                  commitIsWaiting = !committedRequests.isEmpty(); <<<<< moved the queue checks inside the sync block to ensure we don’t have the race condition
                                  requestsToProcess = queuedRequests.size();
                                  // Avoid sync if we have something to do
                                  if (requestsToProcess == 0 && !commitIsWaiting) {
                                          // Waiting for requests to process
                                          while (!stopped && requestsToProcess == 0 && !commitIsWaiting)

      {                                             wait();                                             commitIsWaiting = !committedRequests.isEmpty();                                             requestsToProcess = queuedRequests.size();                                     }

                                  }
                          }
       
      This seems to have fixed the issue. 

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jithingirish Jithin Girish
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 10m
                  1h 10m