Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-18372

Potential infinite busy loop in HMaster's ProcedureExecutor

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Cannot Reproduce
    • 1.3.1
    • None
    • master
    • None
    • Kernel 3.10.0-327.10.1.el7.x86_64
      JVM 1.8.0_102

    Description

      While investigating an issue today with timoha we saw the HMaster consistently burning 1.5 cores of CPU cycles. Upon looking more closely, it was actually all 8 threads of ProcedureExecutor thread pool taking constantly ~15% of a CPU core each (I identified this by looking at individual threads in top and cross-referencing the thread IDs with the thread IDs in a JVM stack trace). The HMaster log or output didn't contain anything suspicious and it was hard for us to ascertain what exactly was happening. It just looked like these threads were regularly spinning, doing nothing. We just saw a lot of futex system calls happening all the time, and all the threads of the thread pool regularly taking turns in waking up and going back to sleep.

      My reading of the code in procedure2/ProcedureExecutor.java is that this can happen if the threads in the thread pool have been interrupted for some reason:

        private void execLoop() {
          while (isRunning()) {
            Procedure proc = runnables.poll();
            if (proc == null) continue;
      

      and then in

      {master/procedure/MasterProcedureScheduler.java}

      :

        @Override
        public Procedure poll() {
          return poll(-1);
        }
      
        @edu.umd.cs.findbugs.annotations.SuppressWarnings("WA_AWAIT_NOT_IN_LOOP")
        Procedure poll(long waitNsec) {
          Procedure pollResult = null;
          schedLock.lock();
          try {
            if (queueSize == 0) {
              if (waitNsec < 0) {
                schedWaitCond.await();
      [...]
          } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
          } finally {
            schedLock.unlock();
          }
          return pollResult;
        }
      

      so my theory is the threads in the thread pool have all been interrupted (maybe by a procedure that ran earlier and left its thread interrupted) and so we are perpetually looping in execLoop, which ends up calling schedWaitCond.await();, which ends up throwing an InterruptedException, which ends up resetting the interrupt status of the thread, and rinse and repeat.

      But again I wasn't able to get any cold hard evidence that this is what was happening. There was just no other evidence that could explain this behavior, and I wasn't able to guess what else could be causing this that was consistent with what we saw and what I understood from reading the code.

      Attachments

        Activity

          People

            Unassigned Unassigned
            tsuna Benoit Sigoure
            Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: