Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-11488

cancelTasks in SubprocedurePool can hang during task error

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 0.96.1, 0.99.0, 0.98.3
    • 0.99.0, 0.98.4
    • snapshots
    • None

    Description

      During snapshot on the region server side, if one RegionSnapshotTask throws exception, we will cancel other tasks.
      In RegionServerSnapshotManager.SnapshotSubprocedurePool.waitForOutstandingTasks():

            LOG.debug("Waiting for local region snapshots to finish.");
      
            int sz = futures.size();
            try {
              // Using the completion service to process the futures that finish first first.
              for (int i = 0; i < sz; i++) {
                Future<Void> f = taskPool.take();
                f.get();
                if (!futures.remove(f)) {
                  LOG.warn("unexpected future" + f);
                }
                LOG.debug("Completed " + (i+1) + "/" + sz +  " local region snapshots.");
              }
              LOG.debug("Completed " + sz +  " local region snapshots.");
              return true;
            } catch (InterruptedException e) {
              LOG.warn("Got InterruptedException in SnapshotSubprocedurePool", e);
              if (!stopped) {
                Thread.currentThread().interrupt();
                throw new ForeignException("SnapshotSubprocedurePool", e);
              }
              // we are stopped so we can just exit.
            } catch (ExecutionException e) {
              if (e.getCause() instanceof ForeignException) {
                LOG.warn("Rethrowing ForeignException from SnapshotSubprocedurePool", e);
                throw (ForeignException)e.getCause();
              }
              LOG.warn("Got Exception in SnapshotSubprocedurePool", e);
              throw new ForeignException(name, e.getCause());
            } finally {
              cancelTasks();
            }
      

      If f.get() throws ExecutionException (for example, caused by NotServingRegionException), we will call cancelTasks().
      In cancelTasks():

           ...
           // evict remaining tasks and futures from taskPool.
           while (!futures.isEmpty()) {
              // block to remove cancelled futures;
              LOG.warn("Removing cancelled elements from taskPool");
              futures.remove(taskPool.take());
            }
      

      For example, suppose we have 3 tasks, the first one fails and we get an exception when we do:

                Future<Void> f = taskPool.take();
                f.get();
      

      We didn't remove the 'f' from the 'futures' list yet, but we already take one from taskPool.
      As a result, there are 3 in 'futures' list, but only 2 remain in taskPool.
      We'll block on taskPool.take() in the above cancelTasks() code.

      The end result is that the procedure will always fail timeout exception in the end.
      We could have bailed out earlier with the real cause.

      Attachments

        1. HBASE-11488-0.98.patch
          0.9 kB
          Jerry He
        2. HBASE-11488-master.patch
          2 kB
          Jerry He

        Activity

          People

            jinghe Jerry He
            jinghe Jerry He
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: