[HBASE-22381] The write request won't refresh its HConnection's local meta cache once an RegionServer got stuck - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

In production environment (Provided by xinxin fan from Netease, HBase version: 1.2.6), we found a case:
1. an RegionServer got stuck;
2. all requests are write requests, and thrown an exception like this:

Caused by: java.net.SocketTimeoutException: 15000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.130.88.181:59049 remote=hbase699.hz.163.org/10.120.192.76:60020] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.FilterInputStream.read(FilterInputStream.java:133) at org.apache.hadoop.hbase.ipc.RpcClient$Connection$PingInputStream.read(RpcClient.java:558) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.hadoop.hbase.ipc.RpcClient$Connection.readResponse(RpcClient.java:1076) at org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClient.java:727)

3. all write request to the stuck region server never clear their client's local meta cache, and requested to the stuck server endlessly, which lead to the availability < 100% in a long time.

I checked the code, and found that in our AsyncRequestFutureImpl#receiveGlobalFailure:

  private void receiveGlobalFailure(
     //....
      updateCachedLocations(server, regionName, row,
        ClientExceptionsUtil.isMetaClearingException(t) ? null : t);
     //....
   }

The isMetaClearingException won't consider the SocketTimeoutException, so the client would always request to the stuck server.

  public static boolean isMetaClearingException(Throwable cur) {
    cur = findException(cur);

    if (cur == null) {
      return true;
    }
    return !isSpecialException(cur) || (cur instanceof RegionMovedException)
        || cur instanceof NotServingRegionException;
  }

  public static boolean isSpecialException(Throwable cur) {
    return (cur instanceof RegionMovedException || cur instanceof RegionOpeningException
        || cur instanceof RegionTooBusyException || cur instanceof RpcThrottlingException
        || cur instanceof MultiActionResultTooLarge || cur instanceof RetryImmediatelyException
        || cur instanceof CallQueueTooBigException || cur instanceof CallDroppedException
        || cur instanceof NotServingRegionException || cur instanceof RequestTooBigException);
  }

The way to fix this would be adding the SocketTimeoutException in isSpecialException. But I'm afraid that if we put the SocketTimeoutException into isSpecialException set, we will increase the pressure of meta table, because there're other cases we may encounter an SocketTimeoutException without any reigon moving, if we clear cache , more request will be directed to meta table.

Attachments

Activity

People

Assignee:: Zheng Hu

Reporter:: Zheng Hu

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 08/May/19 12:03

Updated:: 08/May/19 12:11