Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-27382

Cluster completely down due to WAL splitting failing for hbase:meta table.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • 2.5.0, 1.7.2, 2.4.14
    • None
    • None
    • None

    Description

      We are running some version of 1.7.2 in our production environment. We encountered this issue recently.
      We colocate namenode and region server holding hbase:meta table on a set of 5 master nodes. Co-incidentally active namenode and region server holding meta table were on the same physical node and that node went down due to hardware issue. We have sub optimal hdfs level timeouts configured so whenever active namenode goes down, it takes around 12-15 minutes for hdfs client within hbase to connect to new active namenode. So all the region servers were having problems for about 15 minutes to connect to new active namenode.

      Below are the sequence of events:

      1. Host running active namenode and hbase:meta went down at 2022-09-09 16:56:56,878
      2. HMaster started running ServerCrashProcedure at 2022-09-09 16:59:05,696

      2022-09-09 16:59:05,696 DEBUG [t-processor-pool2-t1] procedure2.ProcedureExecutor - Procedure ServerCrashProcedure serverName=<hmase-meta-RS>,61020,1662714013670, shouldSplitWal=true, carryingMeta=true id=1 owner=dummy state=RUNNABLE:SERVER_CRASH_START added to the store.
      
      2022-09-09 16:59:05,702 DEBUG [t-processor-pool2-t1] master.ServerManager - Added=<hmase-meta-RS>,61020,1662714013670 to dead servers, submitted shutdown handler to be executed meta=true
      
      2022-09-09 16:59:05,707 DEBUG [ProcedureExecutor-0] master.DeadServer - Started processing <hmase-meta-RS>,61020,1662714013670; numProcessing=1
      2022-09-09 16:59:05,712 INFO  [ProcedureExecutor-0] procedure.ServerCrashProcedure - Start processing crashed <hmase-meta-RS>,61020,1662714013670
      

      3. SplitLogManager created 2 split log tasks in zookeeper.

      2022-09-09 16:59:06,049 INFO  [ProcedureExecutor-1] master.SplitLogManager - Started splitting 2 logs in [hdfs://<cluster-name>/hbase/WALs/<hmase-meta-RS>,61020,1662714013670-splitting] for [<hmase-meta-RS>,61020,1662714013670]
      
      2022-09-09 16:59:06,081 DEBUG [main-EventThread] coordination.SplitLogManagerCoordination - put up splitlog task at znode /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta
      
      2022-09-09 16:59:06,093 DEBUG [main-EventThread] coordination.SplitLogManagerCoordination - put up splitlog task at znode /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662739251611.meta
      

      4. The first split log task is more interesting: /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta

      5. Since all the region servers were having problems connecting to active namenode, SplitLogManager tried total of 4 times to assign this task (3 resubmits, configured by hbase.splitlog.max.resubmit) and then finally gave up.

      ---------- try 1 ---------------------
      2022-09-09 16:59:06,205 INFO  [main-EventThread] coordination.SplitLogManagerCoordination - task /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta acquired by <RS-1>,61020,1662540522069
      
      ---------- try 2 ---------------------
      
      2022-09-09 17:01:06,642 INFO  [ager__ChoreService_1] coordination.SplitLogManagerCoordination - resubmitting task /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta
      
      2022-09-09 17:01:06,666 DEBUG [main-EventThread] coordination.SplitLogManagerCoordination - task not yet acquired /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta ver = 2
      
      2022-09-09 17:01:06,715 INFO  [main-EventThread] coordination.SplitLogManagerCoordination - task /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta acquired by <RS-2>,61020,1662530684713
      
      ---------- try 3 ---------------------
      
      2022-09-09 17:03:07,643 INFO  [ager__ChoreService_1] coordination.SplitLogManagerCoordination - resubmitting task /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta
      
      2022-09-09 17:03:07,687 DEBUG [main-EventThread] coordination.SplitLogManagerCoordination - task not yet acquired /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta ver = 4
      
      2022-09-09 17:03:07,738 INFO  [main-EventThread] coordination.SplitLogManagerCoordination - task /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta acquired by <RS-3>,61020,1662542355806
      
      
      ---------- try 4 ---------------------
      2022-09-09 17:05:08,684 INFO  [ager__ChoreService_1] coordination.SplitLogManagerCoordination - resubmitting task /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta
      
      2022-09-09 17:05:08,717 DEBUG [main-EventThread] coordination.SplitLogManagerCoordination - task not yet acquired /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta ver = 6
      
      2022-09-09 17:05:08,757 INFO  [main-EventThread] coordination.SplitLogManagerCoordination - task /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta acquired by <RS-4>,61020,1662543537599
      
      
      ------- after all retries exhausted ---------
      
      2022-09-09 17:07:09,642 INFO  [ager__ChoreService_1] coordination.SplitLogManagerCoordination - Skipping resubmissions of task /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta because threshold 3 reached
      

      6. After it exhausts all the retries, HMaster now will wait indefinitely for <RS-4> to complete the splitting. In this case the last region server died and hmaster couldn't assign hbase:meta table to any other region server.

      7. We failed over the active hmaster and then the new active hmaster was able to successfully assign hbase:meta to another region server.

      Looking at the last region server logs (<RS-4>)

      It acquired the task of splitting log at 2022-09-09 17:05:08,732 and got aborted at 2022-09-09 17:10:12,151
      SplitLogWorker thread was terminated as a part of termination process.

      2022-09-09 17:05:08,732 INFO  [-<RS-4>:61020] coordination.ZkSplitLogWorkerCoordination - worker <RS-4>,61020,1662543537599 acquired task /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta
      
      2022-09-09 17:10:12,151 FATAL [,queue=21,port=61020] regionserver.HRegionServer - ABORTING region server <RS-4>,61020,1662543537599: WAL sync failed, aborting to preserve WAL as source of truth
      
      2022-09-09 17:10:12,185 INFO  [-<RS-4>:61020] regionserver.SplitLogWorker - SplitLogWorker interrupted. Exiting. 
      

      Possible Improvements
      1. Can we remove the maximum number of retries config hbase.splitlog.max.resubmit if table is hbase:meta table or for all the tables ?
      2. If HMaster is not able to assign split wal tasks after hbase.splitlog.max.resubmit retries, then should we just self abort ?

      Any other suggestions are welcome.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              shahrs87 Rushabh Shah
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: