Details
Description
We have a case that a region shows status OPEN on a already dead server in meta table(it is hard to trace how this happen), meaning this region is actually not online. But balance came and scheduled a MoveReionProcedure for this region, which created a mess:
The balancer 'thought' this region was on the server which has the same address(but with different startcode). So it schedules a MRP from this online server to another, but the UnassignProcedure dispatch the unassign call to the dead server according to regionstate, which then found the server dead and schedule a SCP for the dead server. But since the UnassignProcedure's hostingServer is not accurate, the SCP can't interrupt it.
So, in the end, the SCP can't finish since the UnassignProcedure has the region' lock, the UnassignProcedure can not finish since no one wake it, thus stuck.
Here is log, notice that the server of the UnassignProcedure is 'hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584' but it was dispatch to 'hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964'
2018-10-10 14:34:50,011 INFO [PEWorker-4] assignment.RegionTransitionProcedure(252): Dispatch pid=13, ppid=12, state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584; rit=CLOSING, location=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964 2018-10-10 14:34:50,011 WARN [PEWorker-4] assignment.RegionTransitionProcedure(230): Remote call failed hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; pid=13, ppid=12, state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584; rit=CLOSING, location=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; exception=NoServerDispatchException org.apache.hadoop.hbase.procedure2.NoServerDispatchException: hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; pid=13, ppid=12, state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584 //Then a SCP was scheduled 2018-10-10 14:34:50,012 WARN [PEWorker-4] master.ServerManager(635): Expiration of hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964 but server not online 2018-10-10 14:34:50,012 INFO [PEWorker-4] master.ServerManager(615): Processing expiration of hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964 on hb-uf6oyi699w8h700f0-001.hbase.rds. ,16000,1539088156164 2018-10-10 14:34:50,017 DEBUG [PEWorker-4] procedure2.ProcedureExecutor(1089): Stored pid=14, state=RUNNABLE:SERVER_CRASH_START, hasLock=false; ServerCrashProcedure server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964, splitWal=true, meta=false //The SCP did not interrupt the UnassignProcedure but schedule new AssignProcedure for this region 2018-10-10 14:34:50,043 DEBUG [PEWorker-6] procedure.ServerCrashProcedure(250): Done splitting WALs pid=14, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, hasLock=true; ServerCrashProcedure server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964, splitWal=true, meta=false 2018-10-10 14:34:50,054 INFO [PEWorker-8] procedure2.ProcedureExecutor(1691): Initialized subprocedures=[{pid=15, ppid=14, state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; AssignProcedure table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f}, {pid=16, ppid=14, state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; AssignProcedure table=hbase:req_intercept_rule, region=460481706415d776b3742f428a6f579b}, {pid=17, ppid=14, state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; AssignProcedure table=hbase:namespace, region=ec7a965e7302840120a5d8289947c40b}]
Here I also added a safe fence in balancer, if such regions are found, balancing is skipped for safe.It should do no harm.