Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-24438

Stale ServerCrashProcedure task in HBase Master UI

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.2.4
    • None
    • master
    • None
    • HBase 2.2.4
      HDFS 3.1.3 with erasure coding enabled
      Kubernetes

    Description

      Tasks section (show non-RPC Tasks) in HBase Master UI has stale entries with ServerCrashProcedure after master failover. The procedures have finished with SUCCESS on a previously active HBase master and aren't showing in "Procedures & Locks".

      Based on the logs, both of those regionserver were carrying hbase:meta (logs are sorted newest first grepped for those specific servers that have stale ServerCrashProcedures):

      2020-05-21 19:04:09,176 INFO [KeepAlivePEWorker-28] procedure2.ProcedureExecutor: Finished pid=38, state=SUCCESS; ServerCrashProcedure server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, splitWal=true, meta=true in 2.5290sec
       2020-05-21 19:04:08,962 INFO [KeepAlivePEWorker-28] procedure.ServerCrashProcedure: removed crashed server regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 after splitting done
       2020-05-21 19:04:08,747 INFO [KeepAlivePEWorker-28] master.SplitLogManager: dead splitlog workers [regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347]
       2020-05-21 19:04:08,746 INFO [KeepAlivePEWorker-28] master.MasterWalManager: Log dir for server regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 does not exist
       2020-05-21 19:04:08,636 INFO [KeepAlivePEWorker-28] procedure.ServerCrashProcedure: regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 had 0 regions
       2020-05-21 19:04:08,529 INFO [KeepAlivePEWorker-28] procedure.ServerCrashProcedure: pid=38, state=RUNNABLE:SERVER_CRASH_ASSIGN_META, locked=true; ServerCrashProcedure server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, splitWal=true, meta=true found RIT pid=20, ppid=18, state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; rit=ABNORMALLY_CLOSED, location=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, table=hbase:meta, region=1588230740
       2020-05-21 19:04:08,422 INFO [KeepAlivePEWorker-28] master.SplitLogManager: Finished splitting (more than or equal to) 0 (0 bytes) in 0 log files in [hdfs://aeris/hbase/WALs/regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347-splitting] in 0ms
       2020-05-21 19:04:08,416 INFO [KeepAlivePEWorker-28] master.SplitLogManager: hdfs://aeris/hbase/WALs/regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347-splitting dir is empty, no logs to split.
       2020-05-21 19:04:08,414 INFO [KeepAlivePEWorker-28] master.SplitLogManager: dead splitlog workers [regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347]
       2020-05-21 19:04:08,300 INFO [KeepAlivePEWorker-28] procedure.ServerCrashProcedure: Start pid=38, state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, splitWal=true, meta=true
       2020-05-21 19:04:06,544 INFO [RegionServerTracker-0] assignment.AssignmentManager: Scheduled SCP pid=38 for regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 (carryingMeta=true) regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347/CRASHED/regionCount=0/lock=java.util.concurrent.locks.ReentrantReadWriteLock@14e57294[Write locks = 1, Read locks = 0], oldState=ONLINE.
       2020-05-21 19:04:06,434 INFO [RegionServerTracker-0] master.ServerManager: Processing expiration of regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 on hbasemaster-0.hbase.hbase.svc.cluster.local,16000,1590087665366
       2020-05-21 19:04:06,434 INFO [RegionServerTracker-0] master.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347]
      ...
      2020-05-21 19:04:04,711 INFO [KeepAlivePEWorker-27] procedure2.ProcedureExecutor: Finished pid=37, state=SUCCESS; ServerCrashProcedure server=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010, splitWal=true, meta=true in 997msec
       2020-05-21 19:04:04,497 INFO [KeepAlivePEWorker-27] procedure.ServerCrashProcedure: removed crashed server regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010 after splitting done
       2020-05-21 19:04:04,284 INFO [KeepAlivePEWorker-27] master.SplitLogManager: dead splitlog workers [regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010]
       2020-05-21 19:04:04,284 INFO [KeepAlivePEWorker-27] master.MasterWalManager: Log dir for server regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010 does not exist
       2020-05-21 19:04:04,175 INFO [KeepAlivePEWorker-27] procedure.ServerCrashProcedure: regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010 had 1 regions
       2020-05-21 19:04:03,958 INFO [KeepAlivePEWorker-27] procedure.ServerCrashProcedure: pid=37, state=RUNNABLE:SERVER_CRASH_ASSIGN_META, locked=true; ServerCrashProcedure server=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010, splitWal=true, meta=true found RIT pid=20, ppid=18, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; rit=OPENING, location=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010, table=hbase:meta, region=1588230740
       2020-05-21 19:04:03,852 INFO [KeepAlivePEWorker-27] master.SplitLogManager: Finished splitting (more than or equal to) 0 (0 bytes) in 0 log files in [hdfs://aeris/hbase/WALs/regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010-splitting] in 0ms
       2020-05-21 19:04:03,845 INFO [KeepAlivePEWorker-27] master.SplitLogManager: hdfs://aeris/hbase/WALs/regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010-splitting dir is empty, no logs to split.
       2020-05-21 19:04:03,843 INFO [KeepAlivePEWorker-27] master.SplitLogManager: dead splitlog workers [regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010]
       2020-05-21 19:04:03,725 INFO [KeepAlivePEWorker-27] procedure.ServerCrashProcedure: Start pid=37, state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure server=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010, splitWal=true, meta=true
       2020-05-21 19:04:03,512 INFO [RegionServerTracker-0] assignment.AssignmentManager: Scheduled SCP pid=37 for regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010 (carryingMeta=true) regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010/CRASHED/regionCount=1/lock=java.util.concurrent.locks.ReentrantReadWriteLock@5bcc7453[Write locks = 1, Read locks = 0], oldState=ONLINE.
       2020-05-21 19:04:03,500 INFO [RegionServerTracker-0] master.ServerManager: Processing expiration of regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010 on hbasemaster-0.hbase.hbase.svc.cluster.local,16000,1590087665366
       2020-05-21 19:04:03,500 INFO [RegionServerTracker-0] master.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010]
      

       

      The currently active HBase Master (that has the stale tasks) logged only about one regionserver that is shown in stale tasks (logs sorted oldest first):

       

      2020-05-21 19:04:26,467 INFO [master/hbasemaster-1:16000:becomeActiveMaster] assignment.AssignmentManager: Attach pid=20, ppid=18, state=RUNNABLE:REGION_STATE_TRANSITION_OPEN; TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN to rit=OPENING, location=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010, table=hbase:meta, region=1588230740 to restore RIT
      2020-05-21 19:04:26,758 INFO [master/hbasemaster-1:16000:becomeActiveMaster] master.HMaster: hbase:meta {1588230740 state=OPENING, ts=1590087849283, server=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010}
      2020-05-21 19:04:26,784 INFO [PEWorker-1] zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in ZooKeeper as regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010
      2020-05-21 19:04:27,022 WARN [PEWorker-1] assignment.RegionRemoteProcedureBase: Can not add remote operation pid=39, ppid=20, state=RUNNABLE, locked=true; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region {ENCODED => 1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''} to server regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010, this usually because the server is alread dead, give up and mark the procedure as complete, the parent procedure will take care of this.
      org.apache.hadoop.hbase.procedure2.NoServerDispatchException: regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010; pid=39, ppid=20, state=RUNNABLE, locked=true; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
      2020-05-21 19:04:27,048 INFO [PEWorker-2] assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; pid=20, ppid=18, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; rit=OPENING, location=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010
      

       
      It hasn't logged anything about procedures 37 and 38 (the successfully completed ServerCrashProcedures).

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              timoha Andrey Elenskiy
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: