Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-18167

OfflineMetaRepair tool may cause HMaster abort always

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.4.0, 1.3.1, 1.3.2
    • 1.4.0, 1.3.2
    • master
    • None
    • Reviewed

    Description

      In the production environment, we met a weird scenario where some Meta table HFile blocks were missing due to some reason.
      To recover the environment we tried to rebuild the meta using OfflineMetaRepair tool and restart the cluster, but HMaster couldn't finish it's initialization. It always timed out as namespace table region was never assigned.

      Steps to reproduce
      ==================
      1. Assign meta table region to HMaster (it can be on any RS, just to reproduce the scenario)

      	<property>
                  <name>hbase.balancer.tablesOnMaster</name>
                  <value>hbase:meta</value>
              </property>
      

      2. Start HMaster and RegionServer
      2. Create two namespace, say "ns1" & "ns2"
      3. Create two tables "ns1:t1' & "ns2:t1'
      4. flush 'hbase:meta"
      5. Stop HMaster (graceful shutdown)
      6. Kill -9 RegionServer (Abnormal shutdown)
      7. Run OfflineMetaRepair as follows,

      	hbase org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair -fix
      

      8. Restart HMaster and RegionServer
      9. HMaster will never be able to finish its initialization and abort always with below message,

      2017-06-06 15:11:07,582 FATAL [Hostname:16000.activeMasterManager] master.HMaster: Unhandled exception. Starting shutdown.
      java.io.IOException: Timedout 120000ms waiting for namespace table to be assigned
              at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:98)
              at org.apache.hadoop.hbase.master.HMaster.initNamespace(HMaster.java:1054)
              at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:848)
              at org.apache.hadoop.hbase.master.HMaster.access$600(HMaster.java:199)
              at org.apache.hadoop.hbase.master.HMaster$2.run(HMaster.java:1871)
              at java.lang.Thread.run(Thread.java:745)
      

      Root cause
      ==========
      1. During HM start up AM assumes that it's a failover scenario based on the existing old WAL files, so SSH/SCP will split WAL files and assign the holding regions.
      2. During SSH/SCP it retrieves the server holding regions from meta/AM's in-memory-state, but meta only had "regioninfo" entry (as already rebuild by OfflineMetaRepair). So empty region will be returned and it wont trigger any assignment.
      3. HMaster which is waiting for namespace table to be assigned will timeout and abort always.

      Attachments

        1. HBASE-18167-branch-1.patch
          11 kB
          Pankaj Kumar
        2. HBASE-18167-branch-1-V2.patch
          11 kB
          Pankaj Kumar
        3. HBASE-18167-branch-1.3.v2.patch
          11 kB
          Pankaj Kumar

        Activity

          People

            pankaj2461 Pankaj Kumar
            pankaj2461 Pankaj Kumar
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: