[HBASE-20152] [AMv2] DisableTableProcedure versus ServerCrashProcedure - ASF JIRA

Details

Type: Umbrella
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.0.0-alpha-1, 2.2.0, 2.1.1, 2.0.3
Component/s: amv2
Labels:
None

Hadoop Flags:

Reviewed

Description

Seeing a small spate of issues where disabled tables/regions are being assigned. Usually they happen when a DisableTableProcedure is running concurrent with a ServerCrashProcedure. See below. See associated ~~HBASE-20131~~. This is umbrella issue for fixing.

Deadlock

From ~~HBASE-20137~~, 'TestRSGroups is Flakey', https://issues.apache.org/jira/browse/HBASE-20137?focusedCommentId=16390325&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16390325

 * SCP is running because a server was aborted in test.
 * SCP starts AssignProcedure of region X from crashed server.
 * DisableTable Procedure runs because test has finished and we're doing table delete. Queues 
 * UnassignProcedure for region X.
 * Disable Unassign gets Lock on region X first.
 * SCP AssignProcedure tries to get lock, waits on lock.
 * DisableTable Procedure UnassignProcedure RPC fails because server is down (Thats why the SCP).
 * Tries to expire the server it failed the RPC against. Fails (currently being SCP'd).
 * DisableTable Procedure Unassign is suspended. It is a suspend with lock on region X held
 * SCP can't run because lock on X is held
 * Test timesout.

Here is the actual log from around the deadlock. pid=308 is the SCP. pid=309 is the disable table:

2018-03-05 11:29:21,224 DEBUG [PEWorker-7] procedure.ServerCrashProcedure(225): Done splitting WALs pid=308, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; ServerCrashProcedure server=1cfd208ff882,40584,1520249102524, splitWal=true, meta=false
2018-03-05 11:29:21,300 INFO  [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] rsgroup.RSGroupAdminServer(371): Move server done: default=>appInfo
2018-03-05 11:29:21,307 INFO  [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] rsgroup.RSGroupAdminEndpoint$RSGroupAdminServiceImpl(279): Client=jenkins//172.17.0.2 list rsgroup
2018-03-05 11:29:21,312 INFO  [Time-limited test] client.HBaseAdmin$15(901): Started disable of Group_ns:testKillRS
2018-03-05 11:29:21,313 INFO  [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] master.HMaster$7(2278): Client=jenkins//172.17.0.2 disable Group_ns:testKillRS
2018-03-05 11:29:21,384 INFO  [PEWorker-9] procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=310, ppid=308, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043}]
2018-03-05 11:29:21,534 DEBUG [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] procedure2.ProcedureExecutor(865): Stored pid=309, state=RUNNABLE:DISABLE_TABLE_PREPARE; DisableTableProcedure table=Group_ns:testKillRS
2018-03-05 11:29:21,542 DEBUG [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
2018-03-05 11:29:21,644 DEBUG [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
2018-03-05 11:29:21,847 DEBUG [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
2018-03-05 11:29:22,118 DEBUG [PEWorker-5] hbase.MetaTableAccessor(1944): Put {"totalColumns":1,"row":"Group_ns:testKillRS","families":{"table":[{"qualifier":"state","vlen":2,"tag":[],"timestamp":1520249362117}]},"ts":1520249362117}
2018-03-05 11:29:22,123 INFO  [PEWorker-5] hbase.MetaTableAccessor(1646): Updated table Group_ns:testKillRS state to DISABLING in META
2018-03-05 11:29:22,148 DEBUG [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=38498] master.MasterRpcServices(1134): Checking to see if procedure is done pid=309
2018-03-05 11:29:22,345 INFO  [PEWorker-5] procedure2.ProcedureExecutor(1495): Initialized subprocedures=[{pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, server=1cfd208ff882,40584,1520249102524}]
2018-03-05 11:29:22,503 INFO  [PEWorker-13] procedure.MasterProcedureScheduler(571): pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, server=1cfd208ff882,40584,1520249102524, Group_ns:testKillRS,,1520249357742.de7534c208a06502537cd95c248b3043.
2018-03-05 11:29:22,503 INFO  [PEWorker-5] procedure.MasterProcedureScheduler(571): pid=310, ppid=308, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, Group_ns:testKillRS,,1520249357742.de7534c208a06502537cd95c248b3043.
2018-03-05 11:29:22,504 INFO  [PEWorker-13] assignment.RegionStateStore(189): pid=311 updating hbase:meta row=Group_ns:testKillRS,,1520249357742.de7534c208a06502537cd95c248b3043., regionState=CLOSING, regionLocation=1cfd208ff882,40584,1520249102524
2018-03-05 11:29:22,504 DEBUG [PEWorker-5] assignment.RegionTransitionProcedure(377): LOCK_EVENT_WAIT pid=310 serverLocks={}, namespaceLocks={}, tableLocks={}, regionLocks={{de7534c208a06502537cd95c248b3043=exclusiveLockOwner=311, sharedLockCount=0, waitingProcCount=1}}, peerLocks={}
2018-03-05 11:29:22,504 DEBUG [PEWorker-5] procedure2.ProcedureExecutor(1230): LOCK_EVENT_WAIT pid=310, ppid=308, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043
2018-03-05 11:29:22,507 INFO  [PEWorker-13] assignment.RegionTransitionProcedure(213): Dispatch pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, server=1cfd208ff882,40584,1520249102524; rit=CLOSING, location=1cfd208ff882,40584,1520249102524
2018-03-05 11:29:22,508 WARN  [PEWorker-13] assignment.RegionTransitionProcedure(187): Remote call failed pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, server=1cfd208ff882,40584,1520249102524; rit=CLOSING, location=1cfd208ff882,40584,1520249102524; exception=pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
2018-03-05 11:29:22,508 WARN  [PEWorker-13] assignment.UnassignProcedure(276): Expiring server pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, server=1cfd208ff882,40584,1520249102524; rit=CLOSING, location=1cfd208ff882,40584,1520249102524, exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException: pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524

It disable table procedure unassign and the SCP assign arrive at same time, in same ms (2018-03-05 11:29:22,503) but the unassign manages to get the lock first.

Delete of online Regions

Saw this in nightly failure #452 for branch-2 in TestSplitTransactionOnCluster.org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster

 * DisableTableProcedure is queued before SCP.
 * DisableTableProcedure Unassign fails because can't RPC to crashed server and can't expire.
 * Unassign is Stuck in suspend.
 * SCP runs and cleans up suspended Disable Unassign.
 * SCP completes which includes assign of Disable Unassign region.
 * Disable Unassign completes
 * Disable completes.
 * A scheduled Drop Table Procedure runs (its end of test).
 * Succeeds deleting regions that are actually assigned (see above where SCP assigned region).

Attachments

Issue Links

is related to

HBASE-20137 TestRSGroups is flakey

Resolved

Sub-Tasks

1.	NPE in MoveRegionProcedure via IntegrationTestLoadAndVerify with CM	Resolved	Josh Elser
2.	[AMv2] DisableTableProcedure concurrent to ServerCrashProcedure can deadlock	Resolved	Michael Stack
3.	[AMv2] Throw exception if hostile environment	Resolved	Michael Stack
4.	[AMv2] Don't move region if its a split parent or offlined	Resolved	Michael Stack
5.	Make MoveRegionProcedure hold region lock for life of the procedure	Resolved	Michael Stack

[AMv2] DisableTableProcedure versus ServerCrashProcedure