Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Incomplete
-
0.96.0
-
None
-
None
Description
In summary, a server dies and its regions are re-assigned. While right before SSH, balancer is starting assign one region on the server to somewhere.
The balancer assignment got preempted by the SSH assignment:
2013-09-25 11:55:32,854 INFO Priority.RpcServer.handler=7,port=60020 regionserver.HRegionServer: Received CLOSE for the region:6deb1bfefe8cbdb443084efe919fdeb7 , which we are already trying to OPEN. Cancelling OPENING.
The SSH assignment(by GeneralBulkAssigner) failed too due to:
2013-09-25 11:55:32,927 WARN [RS_OPEN_REGION-hor15n09:60020-2] zookeeper.ZKAssign: regionserver:60020-0x14153d449d30ad0 Attempt to transition the unassigned node for 6deb1bfefe8cbdb443084efe919fdeb7 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the server that tried to transition was hor15n09.gq1.ygridcore.net,60020,1380109280320 not the expected hor15n07.gq1.ygridcore.net,60020,1380109890414
In the end, the region 6deb1bfefe8cbdb443084efe919fdeb7 is lost.
Below is the master log, you can see both balancer and SSH try to assign the region around the same time:
2013-09-25 11:55:32,731 INFO [MASTER_SERVER_OPERATIONS-hor15n05:60000-4] master.RegionStates: Transitioning {6deb1bfefe8cbdb443084efe919fdeb7 state=PENDING_CLOSE, ts=1380110132710, server=hor15n12.gq1.ygridcore.net,60020,1380109596307} will be handled by SSH for hor15n12.gq1.ygridcore.net,60020,1380109596307 ... 2013-09-25 11:55:32,849 INFO [hor15n05.gq1.ygridcore.net,60000,1380108611483-BalancerChore] master.RegionStates: Transitioned {6deb1bfefe8cbdb443084efe919fdeb7 state=OFFLINE, ts=1380110132768, server=null} to {6deb1bfefe8cbdb443084efe919fdeb7 state=PENDING_OPEN, ts=1380110132849, server=hor15n07.gq1.ygridcore.net,60020,1380109890414} ... 2013-09-25 11:55:32,898 INFO [hor15n05.gq1.ygridcore.net,60000,1380108611483-GeneralBulkAssigner-1] master.RegionStates: Transitioned {6deb1bfefe8cbdb443084efe919fdeb7 state=OFFLINE, ts=1380110132861, server=null} to {6deb1bfefe8cbdb443084efe919fdeb7 state=PENDING_OPEN, ts=1380110132898, server=hor15n09.gq1.ygridcore.net,60020,1380109280320}
Since SSH force region assignment while it doesn't recreate offline znode, the later region opening would fail with the following error. I'm suggesting to recreate offline znode when we force a region assignment(forceNewPlan=true) with low impact.
2013-09-25 11:55:32,927 WARN [RS_OPEN_REGION-hor15n09:60020-2] zookeeper.ZKAssign: regionserver:60020-0x14153d449d30ad0 Attempt to transition the unassigned node for 6deb1bfefe8cbdb443084efe919fdeb7 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the server that tried to transition was hor15n09.gq1.ygridcore.net,60020,1380109280320 not the expected hor15n07.gq1.ygridcore.net,60020,1380109890414
Attachments
Issue Links
- is related to
-
HBASE-9514 Prevent region from assigning before log splitting is done
- Closed