Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-21614

RIT recovery with ServerCrashProcedure doesn't account for all regions

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.0.0, 2.2.0
    • Component/s: amv2
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Master is restarting after a previous master crashed while recovering some regions from a dead server.

      Master recovers RIT for the region, however the RIT has no location (logged, at least) in CONFIRM_CLOSE state. That is a potential problem #0.5 - confirm where? But that should be covered by meta, so not a big deal, right. As such it doesn't seem to add the region to server map anywhere

      2018-12-17 14:51:14,606 INFO  [master/:17000:becomeActiveMaster] assignment.AssignmentManager: Attach pid=38015, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_CLOSED, hasLock=false; TransitRegionStateProcedure table=t1, region=region1, REOPEN/MOVE to rit=OFFLINE, location=null, table=t1, region=region1 to restore RIT
      

      However, in this case ServerCrashProcedure for the server kicks off BEFORE meta is loaded.
      That seems to be a problem #1 - it immediately gets regions to later recover, so in this case it gets nothing.
      I've grepped our logs for successful cases of SCP interacting with region transition at startup, and in all cases the meta was loaded before SCP.
      Seems like a race condition. Looks like SCP handles it

      2018-12-17 14:51:14,625 INFO  [master/:17000:becomeActiveMaster] master.RegionServerTracker: Starting RegionServerTracker; 0 have existing ServerCrashProcedures, 103 possibly 'live' servers, and 1 'splitting'.
      2018-12-17 14:51:20,770 INFO  [master/:17000:becomeActiveMaster] master.ServerManager: Processing expiration of server1,17020,1544636616174 on ,17000,1545087053243
      2018-12-17 14:51:20,921 INFO  [master/:17000:becomeActiveMaster] assignment.AssignmentManager: Added server1,17020,1544636616174 to dead servers which carryingMeta=false, submitted ServerCrashProcedure pid=111298
      2018-12-17 14:51:30,728 INFO  [PEWorker-13] procedure.ServerCrashProcedure: Start pid=111298, state=RUNNABLE:SERVER_CRASH_START, hasLock=true; ServerCrashProcedure server=server1,17020,1544636616174, splitWal=true, meta=false
      

      Meta is only loaded 11-12 seconds later.
      If one looks at meta-loading code however, there is one more problem #2 - the region is in CLOSING state, so the addRegionToServer is not going to be called - it's only called for OPENED regions.
      Expanding on the above, I've only seen SCP unblock stuck region transition at startup when region started out in meta as OPEN.

      2018-12-17 14:51:42,403 INFO  [master/:17000:becomeActiveMaster] assignment.RegionStateStore: Load hbase:meta entry region=region1, regionState=CLOSING, lastHost=server1,17020,1544636616174, regionLocation=server1,17020,1544636616174, openSeqNum=629131
      

      SCP predictably finishes without doing anything; no other logs for this pid

      2018-12-17 14:52:19,046 INFO  [PEWorker-2] procedure2.ProcedureExecutor: Finished pid=111298, state=SUCCESS, hasLock=false; ServerCrashProcedure server=server1,17020,1544636616174, splitWal=true, meta=false in 58.0010sec
      

      After that, region is still stuck trying to be closed in TransitRegionStateProcedure; it's in the same state for hours including across master restarts.

      2018-12-17 15:09:35,216 WARN  [PEWorker-14] assignment.TransitRegionStateProcedure: Failed transition, suspend 604secs pid=38015, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE, hasLock=true; TransitRegionStateProcedure table=t1, region=region1, REOPEN/MOVE; rit=CLOSING, location=server1,17020,1544636616174; waiting on rectified condition fixed by other Procedure or operator intervention
      

        Attachments

        1. HBASE-21614-v1.patch
          5 kB
          Duo Zhang
        2. HBASE-21614.patch
          5 kB
          Michael Stack
        3. HBASE-21614.patch
          5 kB
          Duo Zhang
        4. HBASE-21614.01.patch
          3 kB
          Sergey Shelukhin

          Activity

            People

            • Assignee:
              sershe Sergey Shelukhin
              Reporter:
              sershe Sergey Shelukhin
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: