HBase
  1. HBase
  2. HBASE-10272

Cluster becomes nonoperational if the node hosting the active Master AND ROOT/META table goes offline

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.96.1, 0.94.15
    • Fix Version/s: 0.98.0, 0.94.16, 0.96.2, 0.99.0
    • Component/s: IPC/RPC
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Since HBASE-6364, HBase client caches a connection failure to a server and any subsequent attempt to connect to the server throws a FailedServerException

      Now if a node which hosted the active Master AND ROOT/META table goes offline, the newly anointed Master's initial attempt to connect to the dead region server will fail with NoRouteToHostException which it handles but since on second attempt crashes with FailedServerException

      Here is the log from one such occurance

      2013-11-20 10:58:00,161 FATAL org.apache.hadoop.hbase.master.HMaster: Master server abort: loaded coprocessors are: []
      2013-11-20 10:58:00,161 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown.
      org.apache.hadoop.hbase.ipc.HBaseClient$FailedServerException: This server is in the failed servers list: xxx02/192.168.1.102:60020
              at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:425)
              at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1124)
              at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:974)
              at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:86)
              at $Proxy9.getProtocolVersion(Unknown Source)
              at org.apache.hadoop.hbase.ipc.WritableRpcEngine.getProxy(WritableRpcEngine.java:138)
              at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:208)
              at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1335)
              at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1294)
              at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1281)
              at org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:506)
              at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:383)
              at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:445)
              at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnection(CatalogTracker.java:464)
              at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyMetaRegionLocation(CatalogTracker.java:624)
              at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:684)
              at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:560)
              at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:376)
              at java.lang.Thread.run(Thread.java:662)
      2013-11-20 10:58:00,162 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
      2013-11-20 10:58:00,162 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60000
      

      Each of the backup master will crash with same error and restarting them will have the same effect. Once this happens, the cluster will remain in-operational until the node with region server is brought online (or the Zookeeper node containing the root region server and/or META entry from the ROOT table is deleted).

      1. HBASE-10272.patch
        2 kB
        Aditya Kishore
      2. HBASE-10272_0.94.patch
        2 kB
        Aditya Kishore

        Activity

        Hide
        Aditya Kishore added a comment -

        Patch for 0.94 branch.

        Show
        Aditya Kishore added a comment - Patch for 0.94 branch.
        Hide
        ramkrishna.s.vasudevan added a comment -

        +1 on patch. Is it possible to write a testcase. Something like in TestMasterFailOver?

        Show
        ramkrishna.s.vasudevan added a comment - +1 on patch. Is it possible to write a testcase. Something like in TestMasterFailOver?
        Hide
        Aditya Kishore added a comment -

        Couldn't find a way to simulate the entire host becoming offline at once. All the kill() and abort() methods close the regions which cleans up the information in ZK which leads up to this situation.

        Show
        Aditya Kishore added a comment - Couldn't find a way to simulate the entire host becoming offline at once. All the kill() and abort() methods close the regions which cleans up the information in ZK which leads up to this situation.
        Hide
        Aditya Kishore added a comment -

        Patch for trunk

        Show
        Aditya Kishore added a comment - Patch for trunk
        Hide
        Aditya Kishore added a comment -

        Submitting to HadoopQA.

        Show
        Aditya Kishore added a comment - Submitting to HadoopQA.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12621405/HBASE-10272.patch
        against trunk revision .
        ATTACHMENT ID: 12621405

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 hadoop1.0. The patch compiles against the hadoop 1.0 profile.

        +1 hadoop1.1. The patch compiles against the hadoop 1.1 profile.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        -1 release audit. The applied patch generated 4 release audit warnings (more than the trunk's current 0 warnings).

        +1 lineLengths. The patch does not introduce lines longer than 100

        -1 site. The patch appears to cause mvn site goal to fail.

        +1 core tests. The patch passed unit tests in .

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/8336//testReport/
        Release audit warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8336//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8336//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8336//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8336//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8336//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8336//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8336//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8336//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8336//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8336//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/8336//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12621405/HBASE-10272.patch against trunk revision . ATTACHMENT ID: 12621405 +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop1.0 . The patch compiles against the hadoop 1.0 profile. +1 hadoop1.1 . The patch compiles against the hadoop 1.1 profile. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. -1 release audit . The applied patch generated 4 release audit warnings (more than the trunk's current 0 warnings). +1 lineLengths . The patch does not introduce lines longer than 100 -1 site . The patch appears to cause mvn site goal to fail. +1 core tests . The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/8336//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8336//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8336//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8336//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8336//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8336//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8336//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8336//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8336//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8336//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8336//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/8336//console This message is automatically generated.
        Hide
        Ted Yu added a comment -

        +1

        Andrew Purtell:
        Do you want this in 0.98 ?

        Show
        Ted Yu added a comment - +1 Andrew Purtell : Do you want this in 0.98 ?
        Hide
        Andrew Purtell added a comment -

        +1

        Show
        Andrew Purtell added a comment - +1
        Hide
        Ted Yu added a comment -

        Integrated to 0.98 and trunk.

        Thanks for the reviews.

        Show
        Ted Yu added a comment - Integrated to 0.98 and trunk. Thanks for the reviews.
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in HBase-TRUNK #4787 (See https://builds.apache.org/job/HBase-TRUNK/4787/)
        HBASE-10272 Cluster becomes nonoperational if the node hosting the active Master AND ROOT/META table goes offline (Tedyu: rev 1555312)

        • /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
        Show
        Hudson added a comment - SUCCESS: Integrated in HBase-TRUNK #4787 (See https://builds.apache.org/job/HBase-TRUNK/4787/ ) HBASE-10272 Cluster becomes nonoperational if the node hosting the active Master AND ROOT/META table goes offline (Tedyu: rev 1555312) /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
        Hide
        Hudson added a comment -

        FAILURE: Integrated in HBase-0.98 #56 (See https://builds.apache.org/job/HBase-0.98/56/)
        HBASE-10272 Cluster becomes nonoperational if the node hosting the active Master AND ROOT/META table goes offline (Tedyu: rev 1555313)

        • /hbase/branches/0.98/hbase-client/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
        Show
        Hudson added a comment - FAILURE: Integrated in HBase-0.98 #56 (See https://builds.apache.org/job/HBase-0.98/56/ ) HBASE-10272 Cluster becomes nonoperational if the node hosting the active Master AND ROOT/META table goes offline (Tedyu: rev 1555313) /hbase/branches/0.98/hbase-client/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in HBase-0.98-on-Hadoop-1.1 #52 (See https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/52/)
        HBASE-10272 Cluster becomes nonoperational if the node hosting the active Master AND ROOT/META table goes offline (Tedyu: rev 1555313)

        • /hbase/branches/0.98/hbase-client/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
        Show
        Hudson added a comment - SUCCESS: Integrated in HBase-0.98-on-Hadoop-1.1 #52 (See https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/52/ ) HBASE-10272 Cluster becomes nonoperational if the node hosting the active Master AND ROOT/META table goes offline (Tedyu: rev 1555313) /hbase/branches/0.98/hbase-client/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in HBase-TRUNK-on-Hadoop-1.1 #41 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-1.1/41/)
        HBASE-10272 Cluster becomes nonoperational if the node hosting the active Master AND ROOT/META table goes offline (Tedyu: rev 1555312)

        • /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
        Show
        Hudson added a comment - SUCCESS: Integrated in HBase-TRUNK-on-Hadoop-1.1 #41 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-1.1/41/ ) HBASE-10272 Cluster becomes nonoperational if the node hosting the active Master AND ROOT/META table goes offline (Tedyu: rev 1555312) /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
        Hide
        Aditya Kishore added a comment -

        Lars Hofhansl, This should go into next 0.94 release.

        Show
        Aditya Kishore added a comment - Lars Hofhansl , This should go into next 0.94 release.
        Hide
        Lars Hofhansl added a comment -

        +1 for 0.94. stack, assume you want this in 0.96.

        Show
        Lars Hofhansl added a comment - +1 for 0.94. stack , assume you want this in 0.96.
        Hide
        Lars Hofhansl added a comment -

        Committed to 0.94.

        Show
        Lars Hofhansl added a comment - Committed to 0.94.
        Hide
        Hudson added a comment -

        FAILURE: Integrated in HBase-0.94-security #379 (See https://builds.apache.org/job/HBase-0.94-security/379/)
        HBASE-10272 Cluster becomes nonoperational if the node hosting the active Master AND ROOT/META table goes offline (Aditya Kishore) (larsh: rev 1555960)

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
        Show
        Hudson added a comment - FAILURE: Integrated in HBase-0.94-security #379 (See https://builds.apache.org/job/HBase-0.94-security/379/ ) HBASE-10272 Cluster becomes nonoperational if the node hosting the active Master AND ROOT/META table goes offline (Aditya Kishore) (larsh: rev 1555960) /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
        Hide
        Lars Hofhansl added a comment -

        Assuming you want this stack, I took the liberty and committed to 0.96 as well.

        Show
        Lars Hofhansl added a comment - Assuming you want this stack , I took the liberty and committed to 0.96 as well.
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in HBase-0.94 #1252 (See https://builds.apache.org/job/HBase-0.94/1252/)
        HBASE-10272 Cluster becomes nonoperational if the node hosting the active Master AND ROOT/META table goes offline (Aditya Kishore) (larsh: rev 1555960)

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
        Show
        Hudson added a comment - SUCCESS: Integrated in HBase-0.94 #1252 (See https://builds.apache.org/job/HBase-0.94/1252/ ) HBASE-10272 Cluster becomes nonoperational if the node hosting the active Master AND ROOT/META table goes offline (Aditya Kishore) (larsh: rev 1555960) /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in HBase-0.94-JDK7 #19 (See https://builds.apache.org/job/HBase-0.94-JDK7/19/)
        HBASE-10272 Cluster becomes nonoperational if the node hosting the active Master AND ROOT/META table goes offline (Aditya Kishore) (larsh: rev 1555960)

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
        Show
        Hudson added a comment - SUCCESS: Integrated in HBase-0.94-JDK7 #19 (See https://builds.apache.org/job/HBase-0.94-JDK7/19/ ) HBASE-10272 Cluster becomes nonoperational if the node hosting the active Master AND ROOT/META table goes offline (Aditya Kishore) (larsh: rev 1555960) /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
        Hide
        Hudson added a comment -

        FAILURE: Integrated in hbase-0.96 #251 (See https://builds.apache.org/job/hbase-0.96/251/)
        HBASE-10272 Cluster becomes nonoperational if the node hosting the active Master AND ROOT/META table goes offline (Aditya Kishore) (larsh: rev 1556015)

        • /hbase/branches/0.96/hbase-client/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
        Show
        Hudson added a comment - FAILURE: Integrated in hbase-0.96 #251 (See https://builds.apache.org/job/hbase-0.96/251/ ) HBASE-10272 Cluster becomes nonoperational if the node hosting the active Master AND ROOT/META table goes offline (Aditya Kishore) (larsh: rev 1556015) /hbase/branches/0.96/hbase-client/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in hbase-0.96-hadoop2 #170 (See https://builds.apache.org/job/hbase-0.96-hadoop2/170/)
        HBASE-10272 Cluster becomes nonoperational if the node hosting the active Master AND ROOT/META table goes offline (Aditya Kishore) (larsh: rev 1556015)

        • /hbase/branches/0.96/hbase-client/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
        Show
        Hudson added a comment - SUCCESS: Integrated in hbase-0.96-hadoop2 #170 (See https://builds.apache.org/job/hbase-0.96-hadoop2/170/ ) HBASE-10272 Cluster becomes nonoperational if the node hosting the active Master AND ROOT/META table goes offline (Aditya Kishore) (larsh: rev 1556015) /hbase/branches/0.96/hbase-client/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java

          People

          • Assignee:
            Aditya Kishore
            Reporter:
            Aditya Kishore
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development