Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-23593

Stalled SCP Assigns

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.2.3
    • Fix Version/s: None
    • Component/s: proc-v2
    • Labels:
      None

      Description

      I'm stuck on this one so doing a write up here in case anyone else has ideas.

      Heavily loaded cluster. Server crashes. SCP cuts in and usually no problem but from time to time I'll see the SCP stuck waiting on an Assign to finish. The assign seems stuck at the queuing of the OpenRegionProcedure. We've stored the procedure but then not a peek thereafter. Later we'll see complaint that the region is STUCK. Doesn't recover. Doesn't run.

      Basic story is as follows:

      Server dies:

       2019-12-17 11:10:42,002 INFO org.apache.hadoop.hbase.master.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [s011.example.org,16020,1576561318119]
       2019-12-17 11:10:42,002 DEBUG org.apache.hadoop.hbase.master.DeadServer: Added s011.example.org,16020,1576561318119; numProcessing=1
      ...
       2019-12-17 11:10:42,110 DEBUG org.apache.hadoop.hbase.master.DeadServer: Started processing s011.example.org,16020,1576561318119; numProcessing=1
      

      The dead server restarts which purges the old server from dead server and processing lists:

       2019-12-17 11:10:58,145 DEBUG org.apache.hadoop.hbase.master.DeadServer: Removed s011.example.org,16020,1576561318119, processing=true, numProcessing=0
       2019-12-17 11:10:58,145 DEBUG org.apache.hadoop.hbase.master.ServerManager: STARTUP: Server s011.example.org,16020,1576581054424 came back up, removed it from the dead servers list
      

      ....even though we are still processing logs in the SCP of the old server...

       2019-12-17 11:10:58,392 INFO org.apache.hadoop.hbase.wal.WALSplitUtil: Archived processed log hdfs://nameservice1/hbase/WALs/s011.example.org,16020,1576561318119-splitting/s011.example.org%2C16020%2C1576561318119.s011.example.org%2C16020%2C1576561318119.regiongroup-0.1576580737491 to hdfs://nameservice1/hbase/oldWALs/s011.example.                            org%2C16020%2C1576561318119.s011.example.org%2C16020%2C1576561318119.regiongroup-0.1576580737491
      

      I thought early purge of deadserver was a problem but I don't think so after study.

      WALS split took two minutes to split and server was removed from dead servers... three minutes earlier...

       2019-12-17 11:13:05,356 INFO org.apache.hadoop.hbase.master.SplitLogManager: Finished splitting (more than or equal to) 30.6G (32908464448 bytes) in 228 log files in [hdfs://nameservice1/hbase/WALs/s011.example.org,16020,1576561318119-splitting] in 143236ms
      

      Almost immediately we get this:

       2019-12-17 11:14:08,649 WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK Region-In-Transition state=OPEN, location=s011.example.org,16020,1576561318119, table=t1, region=9d6d6d5f261a0cbe7c9e85091f2c2bd4
      

      For this region assign, I see the SCP proc making an assign for this region which then makes a subtask to OpenRegionProcedure. This is where it gets stuck. No progress after this. The procedure does not come alive to run.

      Here are logs for the ORP pid=421761:

      2019-12-17 11:38:34,761 INFO org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=421761, ppid=402475, state=RUNNABLE; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
      
      2019-12-17 11:38:34,765 DEBUG org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Add TableQueue(t1, xlock=false sharedLock=3144 size=427) to run queue because: the exclusive lock is not held by anyone when adding pid=421761, ppid=402475, state=RUNNABLE; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
      2019-12-17 11:38:34,770 DEBUG org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure pid=421761, ppid=402475, state=RUNNABLE, locked=true; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure as the 3193th rollback step
      

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              stack Michael Stack
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated: