Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
0.96.1, 0.98.1, 0.99.0
-
None
-
Reviewed
Description
This issue is similar as HBASE-10833 which deal with the sendRegionOpen RPC while the JIRA issue happens with sendRegionClose.
Once a RS in in failed server list due to a network hiccup, AM quickly exhausted all retries and failed the whole region assignment later. Below is a sample stack trace:
2014-03-31 13:39:10,056 INFO [AM.-pool1-t8] master.AssignmentManager: Server hor16n09.gq1.ygridcore.net,60020,1396270942046 returned org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server is in the failed servers list: hor16n09.gq1.ygridcore.net/68.142.246.220:60020 for loadtest_d1,59999994,1396261861562.fcef8d691632e99948fbf876d24f907e., try=20 of 20 org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server is in the failed servers list: hor16n09.gq1.ygridcore.net/68.142.246.220:60020 at org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupIOstreams(RpcClient.java:880) at org.apache.hadoop.hbase.ipc.RpcClient$Connection.writeRequest(RpcClient.java:1065) at org.apache.hadoop.hbase.ipc.RpcClient$Connection.tracedWriteRequest(RpcClient.java:1032) at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1474) at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1684) at org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1737) at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.closeRegion(AdminProtos.java:20854) at org.apache.hadoop.hbase.protobuf.ProtobufUtil.closeRegion(ProtobufUtil.java:1656) at org.apache.hadoop.hbase.master.ServerManager.sendRegionClose(ServerManager.java:693) at org.apache.hadoop.hbase.master.AssignmentManager.unassign(AssignmentManager.java:1685) at org.apache.hadoop.hbase.master.AssignmentManager.forceRegionStateToOffline(AssignmentManager.java:1786) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1436) at org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:45) .... 2014-03-31 13:39:10,056 WARN [AM.-pool1-t8] master.RegionStates: Failed to open/close fcef8d691632e99948fbf876d24f907e on hor16n09.gq1.ygridcore.net,60020,1396270942046, set to FAILED_CLOSE 2014-03-31 13:39:10,056 INFO [AM.-pool1-t8] master.RegionStates: Transitioned {fcef8d691632e99948fbf876d24f907e state=PENDING_OPEN, ts=1396273149814, server=hor16n09.gq1.ygridcore.net,60020,1396270942046} to {fcef8d691632e99948fbf876d24f907e state=FAILED_CLOSE, ts=1396273150056, server=hor16n09.gq1.ygridcore.net,60020,1396270942046} 2014-03-31 13:39:10,056 INFO [AM.-pool1-t8] master.AssignmentManager: Skip assigning {ENCODED => fcef8d691632e99948fbf876d24f907e, NAME => 'loadtest_d1,59999994,1396261861562.fcef8d691632e99948fbf876d24f907e.', STARTKEY => '59999994', ENDKEY => '66666660'}, we couldn't close it: {fcef8d691632e99948fbf876d24f907e state=FAILED_CLOSE, ts=1396273150056, server=hor16n09.gq1.ygridcore.net,60020,1396270942046}