Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
In production environment (Provided by xinxin fan from Netease, HBase version: 1.2.6), we found a case:
1. an RegionServer got stuck;
2. all requests are write requests, and thrown an exception like this:
Caused by: java.net.SocketTimeoutException: 15000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.130.88.181:59049 remote=hbase699.hz.163.org/10.120.192.76:60020] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.FilterInputStream.read(FilterInputStream.java:133) at org.apache.hadoop.hbase.ipc.RpcClient$Connection$PingInputStream.read(RpcClient.java:558) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.hadoop.hbase.ipc.RpcClient$Connection.readResponse(RpcClient.java:1076) at org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClient.java:727)
3. all write request to the stuck region server never clear their client's local meta cache, and requested to the stuck server endlessly, which lead to the availability < 100% in a long time.
I checked the code, and found that in our AsyncRequestFutureImpl#receiveGlobalFailure:
private void receiveGlobalFailure( //.... updateCachedLocations(server, regionName, row, ClientExceptionsUtil.isMetaClearingException(t) ? null : t); //.... }
The isMetaClearingException won't consider the SocketTimeoutException, so the client would always request to the stuck server.
public static boolean isMetaClearingException(Throwable cur) { cur = findException(cur); if (cur == null) { return true; } return !isSpecialException(cur) || (cur instanceof RegionMovedException) || cur instanceof NotServingRegionException; } public static boolean isSpecialException(Throwable cur) { return (cur instanceof RegionMovedException || cur instanceof RegionOpeningException || cur instanceof RegionTooBusyException || cur instanceof RpcThrottlingException || cur instanceof MultiActionResultTooLarge || cur instanceof RetryImmediatelyException || cur instanceof CallQueueTooBigException || cur instanceof CallDroppedException || cur instanceof NotServingRegionException || cur instanceof RequestTooBigException); }
The way to fix this would be adding the SocketTimeoutException in isSpecialException. But I'm afraid that if we put the SocketTimeoutException into isSpecialException set, we will increase the pressure of meta table, because there're other cases we may encounter an SocketTimeoutException without any reigon moving, if we clear cache , more request will be directed to meta table.