[HBASE-20634] Reopen region while server crash can cause the procedure to be stuck - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.0.0-alpha-1, 2.1.0, 2.0.1
Component/s: None
Labels:
None

Hadoop Flags:

Reviewed
Release Note:

Hide
A second attempt at fixing ~~HBASE-20173~~. Fixes unfinished keeping of server state inside AM (ONLINE=>SPLITTING=>OFFLINE=>null). Concurrent unassigns look at server state to figure if they should wait on SCP to wake them up or not.

Show
A second attempt at fixing HBASE-20173 . Fixes unfinished keeping of server state inside AM (ONLINE=>SPLITTING=>OFFLINE=>null). Concurrent unassigns look at server state to figure if they should wait on SCP to wake them up or not.

Description

Found this when implementing ~~HBASE-20424~~, where we will transit the peer sync replication state while there is server crash.

The problem is that, in ServerCrashAssign, we do not have the region lock, so it is possible that after we call handleRIT to clear the existing assign/unassign procedures related to this rs, and before we schedule the assign procedures, it is possible that that we schedule a unassign procedure for a region on the crashed rs. This procedure will not receive the ServerCrashException, instead, in addToRemoteDispatcher, it will find that it can not dispatch the remote call and then a FailedRemoteDispatchException will be raised. But we do not treat this exception the same with ServerCrashException, instead, we will try to expire the rs. Obviously the rs has already been marked as expired, so this is almost a no-op. Then the procedure will be stuck there for ever.

A possible way to fix it is to treat FailedRemoteDispatchException the same with ServerCrashException, as it will be created in addToRemoteDispatcher only, and the only reason we can not dispatch a remote call is that the rs has already been dead. The nodeMap is a ConcurrentMap so I think we could use it as a guard.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HBASE-20634.branch-2.0.001.patch
31/May/18 04:49
41 kB
Michael Stack
HBASE-20634.branch-2.0.002.patch
31/May/18 16:06
41 kB
Michael Stack
HBASE-20634.branch-2.0.003.patch
31/May/18 16:06
41 kB
Michael Stack
HBASE-20634.branch-2.0.004.patch
31/May/18 17:47
41 kB
Michael Stack
HBASE-20634.branch-2.0.005.patch
31/May/18 21:19
43 kB
Michael Stack
HBASE-20634.branch-2.0.006.patch
01/Jun/18 17:30
43 kB
Michael Stack
HBASE-20634.branch-2.0.006.patch
01/Jun/18 05:39
43 kB
Michael Stack
HBASE-20634.branch-2.0.007.patch
01/Jun/18 23:15
45 kB
Michael Stack
HBASE-20634.branch-2.0.008.patch
04/Jun/18 04:05
46 kB
Michael Stack
HBASE-20634.branch-2.0.009.patch
04/Jun/18 04:47
46 kB
Michael Stack
HBASE-20634-UT.patch
27/May/18 12:42
7 kB
Duo Zhang

Issue Links

relates to

HBASE-20173 [AMv2] DisableTableProcedure concurrent to ServerCrashProcedure can deadlock

Resolved

links to

Review Board (branch-2.0)

Sub-Tasks

1.	MoveProcedure can be subtask of ModifyTableProcedure/ReopenTableRegionsProcedure; ensure all kosher.	Closed	Unassigned
2.	Move meta region when server crash can cause the procedure to be stuck	Resolved	Duo Zhang
3.	[hack] Don't add known not-OPEN regions in reopen phase of MTP	Resolved	Josh Elser

Activity

People

Assignee:: Michael Stack

Reporter:: Duo Zhang

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 24/May/18 09:08

Updated:: 01/Aug/18 06:21

Resolved:: 04/Jun/18 19:40