Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-13083

Master can be dead-locked while assigning META.

    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      We got situation when master is deadlocked.
      It seems we have deadlock in master code. In SSH it calls RegionStates#serverOffline which in turn
      aquires synchronized(this) effectively block all requests to RegionStates.
      In another thread it processes assignMeta, which tries to access region states and blocks.
      Finally any assignment operations try to access meta for table states and region operations, but
      cannot do that due of locked RegionStates class.

      serverOffline() waiting for meta availability

      Thread 17019: (state = BLOCKED)
       - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
       - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=186 (Interpreted frame)
       - java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(java.util.concurrent.SynchronousQueue$TransferStack$SNode, boolean, long) @bci=158, line=458 (Compiled frame)
      /serverOffline
       - java.lang.Thread.sleep(long) @bci=0 (Interpreted frame)
       - org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher, int, long) @bci=74, line=605 (Interpreted frame)
       - org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher, long) @bci=4, line=580 (Interpreted frame)
       - org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher, long, org.apache.hadoop.conf.Configuration) @bci=65, line=559 (Interpreted frame)
       - org.apache.hadoop.hbase.client.ZooKeeperRegistry.getMetaRegionLocation() @bci=69, line=58 (Interpreted frame)
       - org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateMeta(org.apache.hadoop.hbase.TableName, boolean, int) @bci=83, line=1131 (Compiled frame)
       - org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(org.apache.hadoop.hbase.TableName, byte[], boolean, boolean, int) @bci=74, line=1098 (Compiled frame)
       - org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.findAllLocationsOrFail(org.apache.hadoop.hbase.client.Action, boolean) @bci=73, line=940 (Compiled frame)
       - org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.groupAndSendMultiAction(java.util.List, int) @bci=48, line=857 (Compiled frame)
       - org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.access$100(org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl, java.util.List, int) @bci=3, line=575 (Compiled frame)
       - org.apache.hadoop.hbase.client.AsyncProcess.submitAll(java.util.concurrent.ExecutorService, org.apache.hadoop.hbase.TableName, java.util.List, org.apache.hadoop.hbase.client.coprocessor.Batch$Callback, java.lang.Object[]) @bci=195, line=557 (Compiled frame)
       - org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.processBatchCallback(java.util.List, org.apache.hadoop.hbase.TableName, java.util.concurrent.ExecutorService, java.lang.Object[], org.apache.hadoop.hbase.client.coprocessor.Batch$Callback) @bci=11, line=2136 (Compiled frame)
       - org.apache.hadoop.hbase.util.MultiHConnection.processBatchCallback(java.util.List, org.apache.hadoop.hbase.TableName, java.lang.Object[], org.apache.hadoop.hbase.client.coprocessor.Batch$Callback) @bci=24, line=125 (Compiled frame)
       - org.apache.hadoop.hbase.master.RegionStateStore.updateRegionState(long, org.apache.hadoop.hbase.master.RegionState, org.apache.hadoop.hbase.master.RegionState) @bci=421, line=244 (Compiled frame)
       - org.apache.hadoop.hbase.master.RegionStates.updateRegionState(org.apache.hadoop.hbase.HRegionInfo, org.apache.hadoop.hbase.master.RegionState$State, org.apache.hadoop.hbase.ServerName, long) @bci=149, line=1109 (Compiled frame)
       - org.apache.hadoop.hbase.master.RegionStates.updateRegionState(org.apache.hadoop.hbase.HRegionInfo, org.apache.hadoop.hbase.master.RegionState$State, org.apache.hadoop.hbase.ServerName) @bci=7, line=425 (Compiled frame)
       - org.apache.hadoop.hbase.master.RegionStates.updateRegionState(org.apache.hadoop.hbase.HRegionInfo, org.apache.hadoop.hbase.master.RegionState$State) @bci=24, line=383 (Compiled frame)
       - org.apache.hadoop.hbase.master.RegionStates.regionOffline(org.apache.hadoop.hbase.HRegionInfo, org.apache.hadoop.hbase.master.RegionState$State) @bci=83, line=586 (Interpreted frame)
       - org.apache.hadoop.hbase.master.RegionStates.regionOffline(org.apache.hadoop.hbase.HRegionInfo) @bci=3, line=566 (Interpreted frame)
       - org.apache.hadoop.hbase.master.RegionStates.serverOffline(org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher, org.apache.hadoop.hbase.ServerName) @bci=494, line=667 (Interpreted frame)
       - org.apache.hadoop.hbase.master.AssignmentManager.processServerShutdown(org.apache.hadoop.hbase.ServerName) @bci=101, line=3334 (Interpreted frame)
       - org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process() @bci=626, line=237 (Interpreted frame)
       - org.apache.hadoop.hbase.executor.EventHandler.run() @bci=33, line=128 (Interpreted frame)
       - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=95, line=1145 (Interpreted frame)
       - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line
      

      Blocked meta looks like:

      Thread 18357: (state = BLOCKED)
       - org.apache.hadoop.hbase.master.RegionStates.getRegionState(java.lang.String) @bci=0, line=1053 (Compiled frame)
       - org.apache.hadoop.hbase.master.RegionStates.getRegionState(org.apache.hadoop.hbase.HRegionInfo) @bci=5, line=1036 (Compiled frame)
       - org.apache.hadoop.hbase.master.AssignmentManager.forceRegionStateToOffline(org.apache.hadoop.hbase.HRegionInfo, boolean) @bci=5, line=1915 (Interpreted frame)
       - org.apache.hadoop.hbase.master.AssignmentManager.assign(org.apache.hadoop.hbase.HRegionInfo, boolean, boolean) @bci=29, line=1564 (Interpreted frame)
       - org.apache.hadoop.hbase.master.AssignmentManager.assign(org.apache.hadoop.hbase.HRegionInfo, boolean) @bci=4, line=1550 (Interpreted frame)
       - org.apache.hadoop.hbase.master.AssignmentManager.assignMeta(org.apache.hadoop.hbase.HRegionInfo) @bci=23, line=2636 (Interpreted frame)
       - org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignMeta() @bci=64, line=159 (Interpreted frame)
       - org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.verifyAndAssignMetaWithRetries() @bci=39, line=184 (Interpreted frame)
       - org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process() @bci=276, line=93 (Interpreted frame)
       - org.apache.hadoop.hbase.executor.EventHandler.run() @bci=33, line=128 (Interpreted frame)
       - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=95, line=1145 (Compiled frame)
       - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 (Interpreted frame)
       - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
      

      Attachments

        1. HBASE-13083-branch-1.patch
          2 kB
          Andrey Stepachev
        2. HBASE-13083.patch
          2 kB
          Andrey Stepachev

        Issue Links

          Activity

            People

              octo47 Andrey Stepachev
              octo47 Andrey Stepachev
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: