HBase
  1. HBase
  2. HBASE-11460

Deadlock in HMaster on masterAndZKLock in HConnectionManager

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.96.0
    • Fix Version/s: 0.99.0, 0.98.4, 2.0.0
    • Component/s: master
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      On one of our clusters we got a deadlock in HMaster.
      In a nutshell deadlock caused by using one HConnectionManager for serving client-like calls and calls from HMaster RPC handlers.

      HBaseAdmin uses HConnectionManager which takes a lock masterAndZKLock.
      On the other side of this game sits TablesNamespaceManager (TNM). This class uses HConnectionManager too (in my case for getting list of available namespaces).
      Problem is that HMaster class uses TNM for serving RPC requests.
      If we look at TNM more closely, we can see, that this class is totally synchronised.

      Thats gives us a problem.

      WebInterface calls request via HConnectionManager and locks masterAndZKLock.
      Connection is blocking, so RpcClient will spin, awaiting for reply (while holding lock).
      That how it looks like in thread dump:

         java.lang.Thread.State: TIMED_WAITING (on object monitor)
      	at java.lang.Object.wait(Native Method)
      	- waiting on <0x00000000c8905430> (a org.apache.hadoop.hbase.ipc.RpcClient$Call)
      	at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1435)
      	- locked <0x00000000c8905430> (a org.apache.hadoop.hbase.ipc.RpcClient$Call)
      	at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1653)
      	at org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1711)
      	at org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$BlockingStub.isMasterRunning(MasterProtos.java:40216)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$MasterServiceState.isMasterRunning(HConnectionManager.java:1467)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.isKeepAliveMasterConnectedAndRunning(HConnectionManager.java:2093)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getKeepAliveMasterService(HConnectionManager.java:1819)
      	- locked <0x00000000d15dc668> (a java.lang.Object)
      	at org.apache.hadoop.hbase.client.HBaseAdmin$MasterCallable.prepare(HBaseAdmin.java:3187)
      	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:119)
      	- locked <0x00000000cd0c1238> (a org.apache.hadoop.hbase.client.RpcRetryingCaller)
      	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:96)
      	- locked <0x00000000cd0c1238> (a org.apache.hadoop.hbase.client.RpcRetryingCaller)
      	at org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3214)
      	at org.apache.hadoop.hbase.client.HBaseAdmin.listTableDescriptorsByNamespace(HBaseAdmin.java:2265)
      

      Some other client call any HMaster RPC, and it calls TablesNamespaceManager methods, which in turn will block on HConnectionManager global lock masterAndZKLock.
      That how it looks like:

        java.lang.Thread.State: BLOCKED (on object monitor)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getKeepAliveZooKeeperWatcher(HConnectionManager.java:1699)
      	- waiting to lock <0x00000000d15dc668> (a java.lang.Object)
      	at org.apache.hadoop.hbase.client.ZooKeeperRegistry.isTableOnlineState(ZooKeeperRegistry.java:100)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.isTableDisabled(HConnectionManager.java:874)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:1027)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:852)
      	at org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:72)
      	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:119)
      	- locked <0x00000000cd0ef108> (a org.apache.hadoop.hbase.client.RpcRetryingCaller)
      	at org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:705)
      	at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:144)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.prefetchRegionCache(HConnectionManager.java:1102)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1162)
      	- locked <0x00000000d1b49fd8> (a java.lang.Object)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1054)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1011)
      	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:852)
      	at org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:72)
      	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:119)
      	- locked <0x00000000cd0ef248> (a org.apache.hadoop.hbase.client.RpcRetryingCaller)
      	at org.apache.hadoop.hbase.client.HTable.get(HTable.java:756)
      	at org.apache.hadoop.hbase.master.TableNamespaceManager.get(TableNamespaceManager.java:134)
      	at org.apache.hadoop.hbase.master.TableNamespaceManager.get(TableNamespaceManager.java:118)
      	- locked <0x00000000d189da20> (a org.apache.hadoop.hbase.master.TableNamespaceManager)
      	at org.apache.hadoop.hbase.master.HMaster.getNamespaceDescriptor(HMaster.java:3113)
      	at org.apache.hadoop.hbase.master.HMaster.listTableDescriptorsByNamespace(HMaster.java:3133)
      	at org.apache.hadoop.hbase.master.HMaster.listTableDescriptorsByNamespace(HMaster.java:3034)
      	at org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java:38261)
      	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175)
      	at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879)
      

      And finally original handler, which should serve request from WebGUI can be blocked on TNM methods effectively forming dead lock.

      1. 11460-v1.txt
        2 kB
        Ted Yu
      2. 11460-v1-0.98.patch
        2 kB
        Andrew Purtell
      3. threads.tdump
        137 kB
        Andrey Stepachev

        Activity

        Hide
        Andrey Stepachev added a comment -

        thread dump attached

        Show
        Andrey Stepachev added a comment - thread dump attached
        Hide
        Ted Yu added a comment -

        Thanks for reporting this, Andrey.

        Here is a tentative patch that changes HConnectionImplementation#keepAliveZookeeperUserCount to AtomicInteger.

        Show
        Ted Yu added a comment - Thanks for reporting this, Andrey. Here is a tentative patch that changes HConnectionImplementation#keepAliveZookeeperUserCount to AtomicInteger.
        Hide
        Andrey Stepachev added a comment -

        Thank you Ted, looks like it fixes issue. Great work.

        Show
        Andrey Stepachev added a comment - Thank you Ted, looks like it fixes issue. Great work.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12653968/11460-v1.txt
        against trunk revision .
        ATTACHMENT ID: 12653968

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        -1 findbugs. The patch appears to introduce 4 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 lineLengths. The patch does not introduce lines longer than 100

        +1 site. The mvn site goal succeeds with this patch.

        +1 core tests. The patch passed unit tests in .

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653968/11460-v1.txt against trunk revision . ATTACHMENT ID: 12653968 +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. -1 findbugs . The patch appears to introduce 4 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 +1 site . The mvn site goal succeeds with this patch. +1 core tests . The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/9970//console This message is automatically generated.
        Hide
        chunhui shen added a comment -

        lgtm
        +1 on patch

        Show
        chunhui shen added a comment - lgtm +1 on patch
        Hide
        rajeshbabu added a comment -

        +1

        Show
        rajeshbabu added a comment - +1
        Hide
        Enis Soztutar added a comment -

        +1 for branch-1. We should get this in 0.98 as well I think.
        Ted, can you change addAndGet() calls to increment/decrementAndGet() for readability at commit time. Thanks.

        Show
        Enis Soztutar added a comment - +1 for branch-1. We should get this in 0.98 as well I think. Ted, can you change addAndGet() calls to increment/decrementAndGet() for readability at commit time. Thanks.
        Hide
        Ted Yu added a comment -
        Show
        Ted Yu added a comment - ping Andrew Purtell
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in HBase-TRUNK #5288 (See https://builds.apache.org/job/HBase-TRUNK/5288/)
        HBASE-11460 Deadlock in HMaster on masterAndZKLock in HConnectionManager (tedyu: rev b3da98a1a28093ec2f0fe0af39e06be636604a5b)

        • hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionManager.java
        Show
        Hudson added a comment - SUCCESS: Integrated in HBase-TRUNK #5288 (See https://builds.apache.org/job/HBase-TRUNK/5288/ ) HBASE-11460 Deadlock in HMaster on masterAndZKLock in HConnectionManager (tedyu: rev b3da98a1a28093ec2f0fe0af39e06be636604a5b) hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionManager.java
        Hide
        Andrew Purtell added a comment -

        Patch for 0.98 including Enis' feedback. Please review and I will commit.

        Show
        Andrew Purtell added a comment - Patch for 0.98 including Enis' feedback. Please review and I will commit.
        Hide
        Ted Yu added a comment -

        lgtm

        Show
        Ted Yu added a comment - lgtm
        Hide
        Andrew Purtell added a comment - - edited

        Committed to 0.98, thanks Ted.
        Edit: 0.98

        Show
        Andrew Purtell added a comment - - edited Committed to 0.98, thanks Ted. Edit: 0.98
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in HBase-1.0 #29 (See https://builds.apache.org/job/HBase-1.0/29/)
        HBASE-11460 Deadlock in HMaster on masterAndZKLock in HConnectionManager (tedyu: rev 94e29bd0a503136b20be3984f6bbdad46b52113a)

        • hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionManager.java
        Show
        Hudson added a comment - SUCCESS: Integrated in HBase-1.0 #29 (See https://builds.apache.org/job/HBase-1.0/29/ ) HBASE-11460 Deadlock in HMaster on masterAndZKLock in HConnectionManager (tedyu: rev 94e29bd0a503136b20be3984f6bbdad46b52113a) hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionManager.java
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in HBase-0.98-on-Hadoop-1.1 #367 (See https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/367/)
        HBASE-11460 Deadlock in HMaster on masterAndZKLock in HConnectionManager (Ted Yu) (apurtell: rev f60e0bd8f3e38c58b95ab0c746a66c595f234653)

        • hbase-client/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java
        Show
        Hudson added a comment - SUCCESS: Integrated in HBase-0.98-on-Hadoop-1.1 #367 (See https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/367/ ) HBASE-11460 Deadlock in HMaster on masterAndZKLock in HConnectionManager (Ted Yu) (apurtell: rev f60e0bd8f3e38c58b95ab0c746a66c595f234653) hbase-client/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in HBase-0.98 #388 (See https://builds.apache.org/job/HBase-0.98/388/)
        HBASE-11460 Deadlock in HMaster on masterAndZKLock in HConnectionManager (Ted Yu) (apurtell: rev f60e0bd8f3e38c58b95ab0c746a66c595f234653)

        • hbase-client/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java
        Show
        Hudson added a comment - SUCCESS: Integrated in HBase-0.98 #388 (See https://builds.apache.org/job/HBase-0.98/388/ ) HBASE-11460 Deadlock in HMaster on masterAndZKLock in HConnectionManager (Ted Yu) (apurtell: rev f60e0bd8f3e38c58b95ab0c746a66c595f234653) hbase-client/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java
        Hide
        Enis Soztutar added a comment -

        Closing this issue after 0.99.0 release.

        Show
        Enis Soztutar added a comment - Closing this issue after 0.99.0 release.

          People

          • Assignee:
            Ted Yu
            Reporter:
            Andrey Stepachev
          • Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development