Details
-
Improvement
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
None
-
Reviewed
Description
One scenario we noticed in production -
we had DisableTableProc and SCP almost triggered at similar time
2024-03-16 17:59:23,014 INFO [PEWorker-11] procedure.DisableTableProcedure -
Set <TABLE_NAME> to state=DISABLING
2024-03-16 17:59:15,243 INFO [PEWorker-26] procedure.ServerCrashProcedure -
Start pid=21592440, state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure
<regionserver>, splitWal=true, meta=false
DisabeTableProc creates unassign procs, and at this time ASSIGNs of SCP is not completed
2024-03-16 17:59:23,003 DEBUG [PEWorker-40] procedure2.ProcedureExecutor - LOCK_EVENT_WAIT pid=21594220, ppid=21592440, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; TransitRegionStateProcedure table=<TABLE_NAME>, region=<regionhash>, ASSIGN
UNASSIGN created by DisableTableProc is stuck on the dead regionserver and we had to manually bypass unassign of DisableTableProc and then do ASSIGN.
If we can break the loop for UNASSIGN procedure to not retry if there is scp for that server, we do not need manual intervention?, at least the DisableTableProc can go to a rollback state?
Attachments
Attachments
Issue Links
- is related to
-
HBASE-23636 Disable table may hang when regionserver stop or abort.
- Resolved
- relates to
-
HBASE-28582 ModifyTableProcedure should not reset TRSP on region node when closing unused region replicas
- Resolved
-
HBASE-28683 Only allow one TableProcedureInterface for a single table to run at the same time for some special procedure types
- Resolved
- links to
The flow by design is SCP will interrupt the TRSP to assign the region first, and then unassign it.
Bypassing the unassign TRSP may cause data loss, as disabling a table does not always mean we want to drop it, it could be enabled later…