Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-11460

Deadlock in HMaster on masterAndZKLock in HConnectionManager

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 0.96.0
    • 0.99.0, 0.98.4
    • master
    • None
    • Reviewed

    Description

      On one of our clusters we got a deadlock in HMaster.
      In a nutshell deadlock caused by using one HConnectionManager for serving client-like calls and calls from HMaster RPC handlers.

      HBaseAdmin uses HConnectionManager which takes a lock masterAndZKLock.
      On the other side of this game sits TablesNamespaceManager (TNM). This class uses HConnectionManager too (in my case for getting list of available namespaces).
      Problem is that HMaster class uses TNM for serving RPC requests.
      If we look at TNM more closely, we can see, that this class is totally synchronised.

      Thats gives us a problem.

      WebInterface calls request via HConnectionManager and locks masterAndZKLock.
      Connection is blocking, so RpcClient will spin, awaiting for reply (while holding lock).
      That how it looks like in thread dump:

         java.lang.Thread.State: TIMED_WAITING (on object monitor)
      	at java.lang.Object.wait(Native Method)
      	- waiting on <0x00000000c8905430> (a org.apache.hadoop.hbase.ipc.RpcClient$Call)
      	at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1435)
      	- locked <0x00000000c8905430> (a org.apache.hadoop.hbase.ipc.RpcClient$Call)
      	at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1653)
      	at org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1711)
      	at org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$BlockingStub.isMasterRunning(MasterProtos.java:40216)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$MasterServiceState.isMasterRunning(HConnectionManager.java:1467)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.isKeepAliveMasterConnectedAndRunning(HConnectionManager.java:2093)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getKeepAliveMasterService(HConnectionManager.java:1819)
      	- locked <0x00000000d15dc668> (a java.lang.Object)
      	at org.apache.hadoop.hbase.client.HBaseAdmin$MasterCallable.prepare(HBaseAdmin.java:3187)
      	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:119)
      	- locked <0x00000000cd0c1238> (a org.apache.hadoop.hbase.client.RpcRetryingCaller)
      	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:96)
      	- locked <0x00000000cd0c1238> (a org.apache.hadoop.hbase.client.RpcRetryingCaller)
      	at org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3214)
      	at org.apache.hadoop.hbase.client.HBaseAdmin.listTableDescriptorsByNamespace(HBaseAdmin.java:2265)
      

      Some other client call any HMaster RPC, and it calls TablesNamespaceManager methods, which in turn will block on HConnectionManager global lock masterAndZKLock.
      That how it looks like:

        java.lang.Thread.State: BLOCKED (on object monitor)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getKeepAliveZooKeeperWatcher(HConnectionManager.java:1699)
      	- waiting to lock <0x00000000d15dc668> (a java.lang.Object)
      	at org.apache.hadoop.hbase.client.ZooKeeperRegistry.isTableOnlineState(ZooKeeperRegistry.java:100)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.isTableDisabled(HConnectionManager.java:874)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:1027)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:852)
      	at org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:72)
      	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:119)
      	- locked <0x00000000cd0ef108> (a org.apache.hadoop.hbase.client.RpcRetryingCaller)
      	at org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:705)
      	at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:144)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.prefetchRegionCache(HConnectionManager.java:1102)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1162)
      	- locked <0x00000000d1b49fd8> (a java.lang.Object)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1054)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1011)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:852)
      	at org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:72)
      	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:119)
      	- locked <0x00000000cd0ef248> (a org.apache.hadoop.hbase.client.RpcRetryingCaller)
      	at org.apache.hadoop.hbase.client.HTable.get(HTable.java:756)
      	at org.apache.hadoop.hbase.master.TableNamespaceManager.get(TableNamespaceManager.java:134)
      	at org.apache.hadoop.hbase.master.TableNamespaceManager.get(TableNamespaceManager.java:118)
      	- locked <0x00000000d189da20> (a org.apache.hadoop.hbase.master.TableNamespaceManager)
      	at org.apache.hadoop.hbase.master.HMaster.getNamespaceDescriptor(HMaster.java:3113)
      	at org.apache.hadoop.hbase.master.HMaster.listTableDescriptorsByNamespace(HMaster.java:3133)
      	at org.apache.hadoop.hbase.master.HMaster.listTableDescriptorsByNamespace(HMaster.java:3034)
      	at org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java:38261)
      	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175)
      	at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879)
      

      And finally original handler, which should serve request from WebGUI can be blocked on TNM methods effectively forming dead lock.

      Attachments

        1. 11460-v1-0.98.patch
          2 kB
          Andrew Kyle Purtell
        2. 11460-v1.txt
          2 kB
          Ted Yu
        3. threads.tdump
          137 kB
          Andrey Stepachev

        Issue Links

          Activity

            People

              yuzhihong@gmail.com Ted Yu
              octo47 Andrey Stepachev
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: