[HBASE-21288] HostingServer in UnassignProcedure is not accurate - ASF JIRA

Voters

Watch issue

Watchers

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.1.0, 2.0.2
Fix Version/s: 2.1.1, 2.0.3
Component/s: amv2, Balancer
Labels:
- balancer

Description

We have a case that a region shows status OPEN on a already dead server in meta table(it is hard to trace how this happen), meaning this region is actually not online. But balance came and scheduled a MoveReionProcedure for this region, which created a mess:
The balancer 'thought' this region was on the server which has the same address(but with different startcode). So it schedules a MRP from this online server to another, but the UnassignProcedure dispatch the unassign call to the dead server according to regionstate, which then found the server dead and schedule a SCP for the dead server. But since the UnassignProcedure's hostingServer is not accurate, the SCP can't interrupt it.
So, in the end, the SCP can't finish since the UnassignProcedure has the region' lock, the UnassignProcedure can not finish since no one wake it, thus stuck.

Here is log, notice that the server of the UnassignProcedure is 'hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584' but it was dispatch to 'hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964'

2018-10-10 14:34:50,011 INFO  [PEWorker-4] assignment.RegionTransitionProcedure(252): Dispatch pid=13, ppid=12, state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584; rit=CLOSING, location=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964
2018-10-10 14:34:50,011 WARN  [PEWorker-4] assignment.RegionTransitionProcedure(230): Remote call failed hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; pid=13, ppid=12, state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584; rit=CLOSING, location=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; exception=NoServerDispatchException
org.apache.hadoop.hbase.procedure2.NoServerDispatchException: hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; pid=13, ppid=12, state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584

//Then a SCP was scheduled
2018-10-10 14:34:50,012 WARN  [PEWorker-4] master.ServerManager(635): Expiration of hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964 but server not online
2018-10-10 14:34:50,012 INFO  [PEWorker-4] master.ServerManager(615): Processing expiration of hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964 on hb-uf6oyi699w8h700f0-001.hbase.rds. ,16000,1539088156164
2018-10-10 14:34:50,017 DEBUG [PEWorker-4] procedure2.ProcedureExecutor(1089): Stored pid=14, state=RUNNABLE:SERVER_CRASH_START, hasLock=false; ServerCrashProcedure server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964, splitWal=true, meta=false

//The SCP did not interrupt the UnassignProcedure but schedule new AssignProcedure for this region
2018-10-10 14:34:50,043 DEBUG [PEWorker-6] procedure.ServerCrashProcedure(250): Done splitting WALs pid=14, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, hasLock=true; ServerCrashProcedure server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964, splitWal=true, meta=false
2018-10-10 14:34:50,054 INFO  [PEWorker-8] procedure2.ProcedureExecutor(1691): Initialized subprocedures=[{pid=15, ppid=14, state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; AssignProcedure table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f}, {pid=16, ppid=14, state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; AssignProcedure table=hbase:req_intercept_rule, region=460481706415d776b3742f428a6f579b}, {pid=17, ppid=14, state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; AssignProcedure table=hbase:namespace, region=ec7a965e7302840120a5d8289947c40b}]

Here I also added a safe fence in balancer, if such regions are found, balancing is skipped for safe.It should do no harm.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HBASE-21288.branch-2.0.001.patch
11/Oct/18 07:50
3 kB
Allan Yang
HBASE-21288.branch-2.0.002.patch
11/Oct/18 14:06
3 kB
Allan Yang

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Allan Yang

Reporter:: Allan Yang

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 11/Oct/18 07:47

Updated:: 22/Apr/19 23:54

Resolved:: 18/Oct/18 16:13

Agile

View on Board

HostingServer in UnassignProcedure is not accurate

Details

Description

Attachments

Attachments

Activity

People

Dates

Agile

Slack

Issue deployment