Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-24585

Failed start recovering crash in standalone mode if procedure-based distributed WAL split & hbase.wal.split.to.hfile=true

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not A Problem
    • None
    • None
    • None
    • None

    Description

      (This description got redone after I figured out what was going on. Previously it was just a litany of me banging around trying to learn procedure-based WAL splitting and hbase.wal.split.to.hfile; no one needs to read that; hence the refactor).

      HBASE-24574 procedure-based distributed WAL splitting is enabled and split-to-hflie too. A force crash requires recovery with ServerCrashProcedure splitting old WALs on restart. The recovery fails because we get stuck. The Master can't assign meta because it is being recovered. The recovery can't make progress because it is asking for a table descriptor for meta – needed by the hbase.wal.split.to.hfile feature – and the master is not yet initialized. After the default timeout, Master shuts down because it can't initialize.

       2020-06-18 19:53:54,175 ERROR [main] master.HMasterCommandLine: Master exiting
       java.lang.RuntimeException: Master not initialized after 200000ms
         at org.apache.hadoop.hbase.util.JVMClusterUtil.waitForEvent(JVMClusterUtil.java:232)
         at org.apache.hadoop.hbase.util.JVMClusterUtil.startup(JVMClusterUtil.java:200)
         at org.apache.hadoop.hbase.LocalHBaseCluster.startup(LocalHBaseCluster.java:430)
         at org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:232)
         at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:140)
         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
         at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:149)
         at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:3059)
      

      The abort of Master interrupts other ongoing actions so later in the log we'll see the WAL split show as interrupted

       2020-06-17 21:20:37,472 ERROR [RS_LOG_REPLAY_OPS-regionserver/localhost:16020-0] handler.RSProcedureHandler: Error when call RSProcedureCallable:
       java.io.IOException: Failed WAL split, status=RESIGNED, wal=file:/Users/stack/checkouts/hbase.apache.git/tmp/hbase/WALs/localhost,16020,1592440848604-splitting/localhost%2C16020%2C1592440848604.meta.1592440852959.meta
         at org.apache.hadoop.hbase.regionserver.SplitWALCallable.splitWal(SplitWALCallable.java:106)
         at org.apache.hadoop.hbase.regionserver.SplitWALCallable.call(SplitWALCallable.java:86)
         at org.apache.hadoop.hbase.regionserver.SplitWALCallable.call(SplitWALCallable.java:49)
         at org.apache.hadoop.hbase.regionserver.handler.RSProcedureHandler.process(RSProcedureHandler.java:49)
         at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
         at java.lang.Thread.run(Thread.java:748)
      

      This issue becomes how to make hbase.wal.split.to.hfile work in standalone mode.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              stack Michael Stack
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: