Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Duplicate
-
2.0.0
-
None
-
None
Description
In our 0.94 hbase cluster, we found that client got wrong region location cache and did not update it after a region is moved to another regionserver.
The reason is wrong client config and bug in RpcRetryingCaller of hbase client.
The rpc configs are following:
hbase.rpc.timeout=1000 hbase.client.pause=200 hbase.client.operation.timeout=1200
But the client retry number is 3
hbase.client.retries.number=3
Assumed that a region is at regionserver A before, and then it is moved to regionserver B. The client try to make a call to regionserver A and get an NotServingRegionException. For the rety number is not 1, the region server location cache is not cleaned. See: RpcRetryingCaller.java#141 and RegionServerCallable.java#127
@Override public void throwable(Throwable t, boolean retrying) { if (t instanceof SocketTimeoutException || .... } else if (t instanceof NotServingRegionException && !retrying) { // Purge cache entries for this specific region from hbase:meta cache // since we don't call connect(true) when number of retries is 1. getConnection().deleteCachedRegionLocation(location); } }
But the call did not retry and throw an SocketTimeoutException for the time the call will take is larger than the operation timeout.See RpcRetryingCaller.java#152
expectedSleep = callable.sleep(pause, tries + 1); // If, after the planned sleep, there won't be enough time left, we stop now. long duration = singleCallDuration(expectedSleep); if (duration > callTimeout) { String msg = "callTimeout=" + callTimeout + ", callDuration=" + duration + ": " + callable.getExceptionMessageAdditionalDetail(); throw (SocketTimeoutException)(new SocketTimeoutException(msg).initCause(t)); }
At last, the wrong region location will never be not cleaned up .
lhofhansl
In hbase 0.94, the MIN_RPC_TIMEOUT in singleCallDuration is 2000 in default, which trigger this bug.
private long singleCallDuration(final long expectedSleep) { return (EnvironmentEdgeManager.currentTimeMillis() - this.globalStartTime) + MIN_RPC_TIMEOUT + expectedSleep; }
But there is risk in master code too.
Attachments
Attachments
Issue Links
- is related to
-
HBASE-15354 Use same criteria for clearing meta cache for all operations
- Resolved