Description
The direct reason is we are stuck in ServerManager.isServerReachable.
2015-03-06 04:06:19,430 DEBUG [AM.-pool300-t1] master.ServerManager(855): Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=0 of 10 2015-03-06 04:07:10,545 DEBUG [AM.-pool300-t1] master.ServerManager(855): Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=9 of 10
The interval between first and last retry log is about 1 minute, and we only wait 1 minute so the test is timeout.
Still do not know why this happen.
And at last there are lots of this
2015-03-06 04:07:21,529 DEBUG [AM.-pool300-t1] master.ServerManager(855): Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=9 of 10 org.apache.hadoop.hbase.ipc.StoppedRpcClientException at org.apache.hadoop.hbase.ipc.RpcClientImpl.getConnection(RpcClientImpl.java:1261) at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1146) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287) at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.getServerInfo(AdminProtos.java:22031) at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1797) at org.apache.hadoop.hbase.master.ServerManager.isServerReachable(ServerManager.java:850) at org.apache.hadoop.hbase.master.RegionStates.isServerDeadAndNotProcessed(RegionStates.java:843) at org.apache.hadoop.hbase.master.AssignmentManager.forceRegionStateToOffline(AssignmentManager.java:1969) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1576) at org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:48) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744)
I think the problem is here
ServerManager.java
while (retryCounter.shouldRetry()) { ... try { retryCounter.sleepUntilNextRetry(); } catch(InterruptedException ie) { Thread.currentThread().interrupt(); } ... }
We need to break out of the while loop when getting InterruptedException, not just mark current thread as interrupted.