Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.5.8
-
Reviewed
Description
A CloseRegionProcedure on master requests the RS to close the region and after closing the region RS reports RegionStateTransition back(here). On receiving the report, the master checks if regionNode has any procedure assigned to it (code).
private boolean reportTransition(RegionStateNode regionNode, ServerStateNode serverNode, TransitionCode state, long seqId, long procId) throws IOException { ServerName serverName = serverNode.getServerName(); TransitRegionStateProcedure proc = regionNode.getProcedure(); if (proc == null) { return false; } proc.reportTransition(master.getMasterProcedureExecutor().getEnvironment(), regionNode, serverName, state, seqId, procId); return true; }
If regionNode doesn't have any procedure, the master just logs it and doesn't throw any error to RPC.
Think of a case when MasterFailover is happening and the new Active master only initialized the TRSP and CloseRegionProcedure. Now aborting Master has stale/false data. If the transition report comes to the aborting master, not rejecting this report is causing the procedure to get stuck.
Logs for more understanding
active master server4-1 failing
2024-06-20 04:45:05,576 ERROR [iority.RWQ.Fifo.write.handler=3,queue=0,port=61000] master.HMaster - ***** ABORTING master server4-1,61000,1715413775736: Failed to record region server as started *****
logs of new active master server5-1
2024-06-20 04:49:28,893 DEBUG [aster/server5-1:61000:becomeActiveMaster] assignment.RegionStateStore - Load hbase:meta entry region=888a715d5926adbb89c985d8967f40d4, regionState=OPEN, lastHost=server1-119,61020,1717560166420, regionLocation=server1-119,61020,1717560166420, openSeqNum=34892620 024-06-20 04:49:51,886 INFO [PEWorker-22] procedure2.ProcedureExecutor - Initialized subprocedures=[{pid=16276416, ppid=16276108, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure table=RIMBS.UPLOADER_JOB_DETAILS, region=888a715d5926adbb89c985d8967f40d4, UNASSIGN}] (on server5-1) 2024-06-20 04:49:52,022 INFO [PEWorker-40] procedure2.ProcedureExecutor - Initialized subprocedures=[{pid=16276470, ppid=16276416, state=RUNNABLE; CloseRegionProcedure 888a715d5926adbb89c985d8967f40d4, server=server1-119,61020,1717560166420}] (on server5-1)
RS logs for closing
2024-06-20 04:49:52,267 INFO [_REGION-regionserver/server1-119:61020-2] handler.UnassignRegionHandler - Close 888a715d5926adbb89c985d8967f40d4 2024-06-20 04:49:52,267 DEBUG [_REGION-regionserver/server1-119:61020-2] regionserver.HRegion - Closing 888a715d5926adbb89c985d8967f40d4, disabling compactions & flushes 2024-06-20 04:49:52,354 INFO [_REGION-regionserver/server1-119:61020-2] regionserver.HRegion - Closed TABLE,KW\x00na240-app1-16\x00/Events-120620231740\x00MARKER-Events,1702619592612.888a715d5926adbb89c985d8967f40d4.
Logs of report on aborting active Hmaster
2024-06-20 04:49:52,355 WARN [iority.RWQ.Fifo.write.handler=1,queue=0,port=61000] assignment.AssignmentManager - No matching procedure found for server1-119,61020,1717560166420 transition on state=OPEN, location=server1-119,61020,1717560166420, table=RIMBS.UPLOADER_JOB_DETAILS, region=888a715d5926adbb89c985d8967f40d4 to CLOSED ( host = server4-1 , hbaseMasterLogFile)