HBase
  1. HBase
  2. HBASE-11460

Deadlock in HMaster on masterAndZKLock in HConnectionManager

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.96.0
    • Fix Version/s: 0.99.0, 0.98.4, 2.0.0
    • Component/s: master
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      On one of our clusters we got a deadlock in HMaster.
      In a nutshell deadlock caused by using one HConnectionManager for serving client-like calls and calls from HMaster RPC handlers.

      HBaseAdmin uses HConnectionManager which takes a lock masterAndZKLock.
      On the other side of this game sits TablesNamespaceManager (TNM). This class uses HConnectionManager too (in my case for getting list of available namespaces).
      Problem is that HMaster class uses TNM for serving RPC requests.
      If we look at TNM more closely, we can see, that this class is totally synchronised.

      Thats gives us a problem.

      WebInterface calls request via HConnectionManager and locks masterAndZKLock.
      Connection is blocking, so RpcClient will spin, awaiting for reply (while holding lock).
      That how it looks like in thread dump:

         java.lang.Thread.State: TIMED_WAITING (on object monitor)
      	at java.lang.Object.wait(Native Method)
      	- waiting on <0x00000000c8905430> (a org.apache.hadoop.hbase.ipc.RpcClient$Call)
      	at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1435)
      	- locked <0x00000000c8905430> (a org.apache.hadoop.hbase.ipc.RpcClient$Call)
      	at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1653)
      	at org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1711)
      	at org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$BlockingStub.isMasterRunning(MasterProtos.java:40216)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$MasterServiceState.isMasterRunning(HConnectionManager.java:1467)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.isKeepAliveMasterConnectedAndRunning(HConnectionManager.java:2093)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getKeepAliveMasterService(HConnectionManager.java:1819)
      	- locked <0x00000000d15dc668> (a java.lang.Object)
      	at org.apache.hadoop.hbase.client.HBaseAdmin$MasterCallable.prepare(HBaseAdmin.java:3187)
      	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:119)
      	- locked <0x00000000cd0c1238> (a org.apache.hadoop.hbase.client.RpcRetryingCaller)
      	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:96)
      	- locked <0x00000000cd0c1238> (a org.apache.hadoop.hbase.client.RpcRetryingCaller)
      	at org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3214)
      	at org.apache.hadoop.hbase.client.HBaseAdmin.listTableDescriptorsByNamespace(HBaseAdmin.java:2265)
      

      Some other client call any HMaster RPC, and it calls TablesNamespaceManager methods, which in turn will block on HConnectionManager global lock masterAndZKLock.
      That how it looks like:

        java.lang.Thread.State: BLOCKED (on object monitor)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getKeepAliveZooKeeperWatcher(HConnectionManager.java:1699)
      	- waiting to lock <0x00000000d15dc668> (a java.lang.Object)
      	at org.apache.hadoop.hbase.client.ZooKeeperRegistry.isTableOnlineState(ZooKeeperRegistry.java:100)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.isTableDisabled(HConnectionManager.java:874)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:1027)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:852)
      	at org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:72)
      	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:119)
      	- locked <0x00000000cd0ef108> (a org.apache.hadoop.hbase.client.RpcRetryingCaller)
      	at org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:705)
      	at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:144)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.prefetchRegionCache(HConnectionManager.java:1102)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1162)
      	- locked <0x00000000d1b49fd8> (a java.lang.Object)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1054)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1011)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:852)
      	at org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:72)
      	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:119)
      	- locked <0x00000000cd0ef248> (a org.apache.hadoop.hbase.client.RpcRetryingCaller)
      	at org.apache.hadoop.hbase.client.HTable.get(HTable.java:756)
      	at org.apache.hadoop.hbase.master.TableNamespaceManager.get(TableNamespaceManager.java:134)
      	at org.apache.hadoop.hbase.master.TableNamespaceManager.get(TableNamespaceManager.java:118)
      	- locked <0x00000000d189da20> (a org.apache.hadoop.hbase.master.TableNamespaceManager)
      	at org.apache.hadoop.hbase.master.HMaster.getNamespaceDescriptor(HMaster.java:3113)
      	at org.apache.hadoop.hbase.master.HMaster.listTableDescriptorsByNamespace(HMaster.java:3133)
      	at org.apache.hadoop.hbase.master.HMaster.listTableDescriptorsByNamespace(HMaster.java:3034)
      	at org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java:38261)
      	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175)
      	at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879)
      

      And finally original handler, which should serve request from WebGUI can be blocked on TNM methods effectively forming dead lock.

      1. 11460-v1.txt
        2 kB
        Ted Yu
      2. 11460-v1-0.98.patch
        2 kB
        Andrew Purtell
      3. threads.tdump
        137 kB
        Andrey Stepachev

        Issue Links

          Activity

          Hide
          Ted Yu added a comment -

          Created HBASE-15080 with patch for 0.98 branch.

          Show
          Ted Yu added a comment - Created HBASE-15080 with patch for 0.98 branch.
          Hide
          Josh Elser added a comment -

          Should I open a JIRA to remove the synchronized block (considering this issue is 1 year old) ?

          Ya, that's what I'd say. It's just a bug at this point, IMO

          Show
          Josh Elser added a comment - Should I open a JIRA to remove the synchronized block (considering this issue is 1 year old) ? Ya, that's what I'd say. It's just a bug at this point, IMO
          Hide
          Ted Yu added a comment -

          Josh Elser found that in 0.98, the synchronized block below should have been taken out (as was done for branch-1 +):

                synchronized (masterAndZKLock) {
                  if (keepAliveZookeeperUserCount.decrementAndGet() <= 0 ){
          

          keepAliveZookeeperUserCount is an AtomicInteger. There is no need for the synchronized block.
          Andrew Purtell:
          Should I open a JIRA to remove the synchronized block (considering this issue is 1 year old) ?

          Show
          Ted Yu added a comment - Josh Elser found that in 0.98, the synchronized block below should have been taken out (as was done for branch-1 +): synchronized (masterAndZKLock) { if (keepAliveZookeeperUserCount.decrementAndGet() <= 0 ){ keepAliveZookeeperUserCount is an AtomicInteger. There is no need for the synchronized block. Andrew Purtell : Should I open a JIRA to remove the synchronized block (considering this issue is 1 year old) ?
          Hide
          Enis Soztutar added a comment -

          Closing this issue after 0.99.0 release.

          Show
          Enis Soztutar added a comment - Closing this issue after 0.99.0 release.
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in HBase-0.98 #388 (See https://builds.apache.org/job/HBase-0.98/388/)
          HBASE-11460 Deadlock in HMaster on masterAndZKLock in HConnectionManager (Ted Yu) (apurtell: rev f60e0bd8f3e38c58b95ab0c746a66c595f234653)

          • hbase-client/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java
          Show
          Hudson added a comment - SUCCESS: Integrated in HBase-0.98 #388 (See https://builds.apache.org/job/HBase-0.98/388/ ) HBASE-11460 Deadlock in HMaster on masterAndZKLock in HConnectionManager (Ted Yu) (apurtell: rev f60e0bd8f3e38c58b95ab0c746a66c595f234653) hbase-client/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in HBase-0.98-on-Hadoop-1.1 #367 (See https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/367/)
          HBASE-11460 Deadlock in HMaster on masterAndZKLock in HConnectionManager (Ted Yu) (apurtell: rev f60e0bd8f3e38c58b95ab0c746a66c595f234653)

          • hbase-client/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java
          Show
          Hudson added a comment - SUCCESS: Integrated in HBase-0.98-on-Hadoop-1.1 #367 (See https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/367/ ) HBASE-11460 Deadlock in HMaster on masterAndZKLock in HConnectionManager (Ted Yu) (apurtell: rev f60e0bd8f3e38c58b95ab0c746a66c595f234653) hbase-client/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in HBase-1.0 #29 (See https://builds.apache.org/job/HBase-1.0/29/)
          HBASE-11460 Deadlock in HMaster on masterAndZKLock in HConnectionManager (tedyu: rev 94e29bd0a503136b20be3984f6bbdad46b52113a)

          • hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionManager.java
          Show
          Hudson added a comment - SUCCESS: Integrated in HBase-1.0 #29 (See https://builds.apache.org/job/HBase-1.0/29/ ) HBASE-11460 Deadlock in HMaster on masterAndZKLock in HConnectionManager (tedyu: rev 94e29bd0a503136b20be3984f6bbdad46b52113a) hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionManager.java
          Hide
          Andrew Purtell added a comment - - edited

          Committed to 0.98, thanks Ted.
          Edit: 0.98

          Show
          Andrew Purtell added a comment - - edited Committed to 0.98, thanks Ted. Edit: 0.98
          Hide
          Ted Yu added a comment -

          lgtm

          Show
          Ted Yu added a comment - lgtm
          Hide
          Andrew Purtell added a comment -

          Patch for 0.98 including Enis' feedback. Please review and I will commit.

          Show
          Andrew Purtell added a comment - Patch for 0.98 including Enis' feedback. Please review and I will commit.
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in HBase-TRUNK #5288 (See https://builds.apache.org/job/HBase-TRUNK/5288/)
          HBASE-11460 Deadlock in HMaster on masterAndZKLock in HConnectionManager (tedyu: rev b3da98a1a28093ec2f0fe0af39e06be636604a5b)

          • hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionManager.java
          Show
          Hudson added a comment - SUCCESS: Integrated in HBase-TRUNK #5288 (See https://builds.apache.org/job/HBase-TRUNK/5288/ ) HBASE-11460 Deadlock in HMaster on masterAndZKLock in HConnectionManager (tedyu: rev b3da98a1a28093ec2f0fe0af39e06be636604a5b) hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionManager.java
          Hide
          Ted Yu added a comment -
          Show
          Ted Yu added a comment - ping Andrew Purtell
          Hide
          Enis Soztutar added a comment -

          +1 for branch-1. We should get this in 0.98 as well I think.
          Ted, can you change addAndGet() calls to increment/decrementAndGet() for readability at commit time. Thanks.

          Show
          Enis Soztutar added a comment - +1 for branch-1. We should get this in 0.98 as well I think. Ted, can you change addAndGet() calls to increment/decrementAndGet() for readability at commit time. Thanks.
          Hide
          rajeshbabu added a comment -

          +1

          Show
          rajeshbabu added a comment - +1
          Hide
          chunhui shen added a comment -

          lgtm
          +1 on patch

          Show
          chunhui shen added a comment - lgtm +1 on patch
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12653968/11460-v1.txt
          against trunk revision .
          ATTACHMENT ID: 12653968

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 findbugs. The patch appears to introduce 4 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 lineLengths. The patch does not introduce lines longer than 100

          +1 site. The mvn site goal succeeds with this patch.

          +1 core tests. The patch passed unit tests in .

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653968/11460-v1.txt against trunk revision . ATTACHMENT ID: 12653968 +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. -1 findbugs . The patch appears to introduce 4 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 +1 site . The mvn site goal succeeds with this patch. +1 core tests . The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//console This message is automatically generated.
          Hide
          Andrey Stepachev added a comment -

          Thank you Ted, looks like it fixes issue. Great work.

          Show
          Andrey Stepachev added a comment - Thank you Ted, looks like it fixes issue. Great work.
          Hide
          Ted Yu added a comment -

          Thanks for reporting this, Andrey.

          Here is a tentative patch that changes HConnectionImplementation#keepAliveZookeeperUserCount to AtomicInteger.

          Show
          Ted Yu added a comment - Thanks for reporting this, Andrey. Here is a tentative patch that changes HConnectionImplementation#keepAliveZookeeperUserCount to AtomicInteger.
          Hide
          Andrey Stepachev added a comment -

          thread dump attached

          Show
          Andrey Stepachev added a comment - thread dump attached

            People

            • Assignee:
              Ted Yu
              Reporter:
              Andrey Stepachev
            • Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development