Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-10272

Cluster becomes nonoperational if the node hosting the active Master AND ROOT/META table goes offline

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 0.96.1, 0.94.15
    • 0.98.0, 0.94.16, 0.96.2, 0.99.0
    • IPC/RPC
    • None
    • Reviewed

    Description

      Since HBASE-6364, HBase client caches a connection failure to a server and any subsequent attempt to connect to the server throws a FailedServerException

      Now if a node which hosted the active Master AND ROOT/META table goes offline, the newly anointed Master's initial attempt to connect to the dead region server will fail with NoRouteToHostException which it handles but since on second attempt crashes with FailedServerException

      Here is the log from one such occurance

      2013-11-20 10:58:00,161 FATAL org.apache.hadoop.hbase.master.HMaster: Master server abort: loaded coprocessors are: []
      2013-11-20 10:58:00,161 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown.
      org.apache.hadoop.hbase.ipc.HBaseClient$FailedServerException: This server is in the failed servers list: xxx02/192.168.1.102:60020
              at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:425)
              at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1124)
              at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:974)
              at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:86)
              at $Proxy9.getProtocolVersion(Unknown Source)
              at org.apache.hadoop.hbase.ipc.WritableRpcEngine.getProxy(WritableRpcEngine.java:138)
              at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:208)
              at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1335)
              at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1294)
              at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1281)
              at org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:506)
              at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:383)
              at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:445)
              at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnection(CatalogTracker.java:464)
              at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyMetaRegionLocation(CatalogTracker.java:624)
              at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:684)
              at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:560)
              at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:376)
              at java.lang.Thread.run(Thread.java:662)
      2013-11-20 10:58:00,162 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
      2013-11-20 10:58:00,162 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60000
      

      Each of the backup master will crash with same error and restarting them will have the same effect. Once this happens, the cluster will remain in-operational until the node with region server is brought online (or the Zookeeper node containing the root region server and/or META entry from the ROOT table is deleted).

      Attachments

        1. HBASE-10272.patch
          2 kB
          Aditya Kishore
        2. HBASE-10272_0.94.patch
          2 kB
          Aditya Kishore

        Activity

          People

            adityakishore Aditya Kishore
            adityakishore Aditya Kishore
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: