[HBASE-26955] Improvement of the pause time between retries in Rpc caller - ASF JIRA

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.4.2
Fix Version/s: None
Component/s: Client
Labels:
None

Description

We were using the HBase version 2.4.2 and had clients submitting requests to a HBase cluster. We observed an unreasonable delay of 10 seconds in some request of table create. We analyze the log and jstack results and reproduce this issue. The jstack of the client shows that this wierd long delay comes from https://github.com/apache/hbase/blob/rel/2.4.2/hbase-client/src/main/java/org/apache/hadoop/hbase/client/RpcRetryingCallerImpl.java#L168 :

"main" #1 prio=5 os_prio=0 tid=0x00007f027c00b800 nid=0x530d in Object.wait() [0x00007f02853b1000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00000003d52674f0> (a java.util.concurrent.atomic.AtomicBoolean)
        at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:168)
        - locked <0x00000003d52674f0> (a java.util.concurrent.atomic.AtomicBoolean)
        at org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3018)
        at org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3010)
        at org.apache.hadoop.hbase.client.HBaseAdmin.createTableAsync(HBaseAdmin.java:648)
        at org.apache.hadoop.hbase.client.HBaseAdmin.createTableAsync(HBaseAdmin.java:4326)
        at org.apache.hadoop.hbase.client.Admin.createTable(Admin.java:317)
        at edu.jhu.order.mcgray.hb_2_4_2.HBaseGrayClient.createTable(HBaseGrayClient.java:63)
        at edu.jhu.order.mcgray.hb_2_4_2.HBaseGrayClientMain.create(HBaseGrayClientMain.java:70)
        at edu.jhu.order.mcgray.hb_2_4_2.HBaseGrayClientMain.run(HBaseGrayClientMain.java:157)
        at edu.jhu.order.mcgray.hb_2_4_2.HBaseGrayClientMain.main(HBaseGrayClientMain.java:183)

This delay of wait comes from the calculation within https://github.com/apache/hbase/blob/rel/2.4.2/hbase-client/src/main/java/org/apache/hadoop/hbase/client/RpcRetryingCallerImpl.java#L96-L177 , which can be simplified as follows:

  public T callWithRetries(RetryingCallable<T> callable, int callTimeout)
  throws IOException, RuntimeException {
    // ...
    for (int tries = 0;; tries++) {
      long expectedSleep;
      try {
        // do the call ...
      } catch (PreemptiveFastFailException e) {
        throw e;
      } catch (Throwable t) {
        ExceptionUtil.rethrowIfInterrupt(t);
        Throwable cause = t.getCause();
        if (cause instanceof DoNotRetryIOException) {  // line 113
          // Fail fast
          throw (DoNotRetryIOException) cause;
        }
        // translateException throws exception when should not retry: i.e. when request is bad.
        interceptor.handleFailure(context, t);
        t = translateException(t);
        if (tries > startLogErrorsCnt) {
          // log ...
        }
        // ...
        // If the server is dead, we need to wait a little before retrying, to give
        // a chance to the regions to be moved
        // get right pause time, start by RETRY_BACKOFF[0] * pauseBase, where pauseBase might be
        // special when encountering CallQueueTooBigException, see #HBASE-17114
        long pauseBase = (t instanceof CallQueueTooBigException) ? pauseForCQTBE : pause;  // line 151
        expectedSleep = callable.sleep(pauseBase, tries); //  line 152
        // If, after the planned sleep, there won't be enough time left, we stop now.
        long duration = singleCallDuration(expectedSleep);
        if (duration > callTimeout) {
          // throw exception ...
        }
      } finally {
        interceptor.updateFailureInfo(context);
      }
      try {
        if (expectedSleep > 0) {
          synchronized (cancelled) {
            if (cancelled.get()) return null;
            cancelled.wait(expectedSleep);  // line 168
          }
        }
        if (cancelled.get()) return null;
      } catch (InterruptedException e) {
        throw new InterruptedIOException("Interrupted after " + tries
            + " tries while maxAttempts=" + maxAttempts);
      }
    }
  }

Within the retry loop, the `cause` variable represents the reason for the current failed request and it is checked to determine what we should do. For example, the case of DoNotRetryIOException is checked at line 113, and the case of CallQueueTooBigException is checked at line 151. Finally we determine `expectedSleep`, the time we should wait before the next retry. And then we enforce this wait at line 168, unless it is canceled.

The comments seems to imply that this retry is due to dead server:

        // If the server is dead, we need to wait a little before retrying, to give
        // a chance to the regions to be moved
        // get right pause time, start by RETRY_BACKOFF[0] * pauseBase, where pauseBase might be
        // special when encountering CallQueueTooBigException, see #HBASE-17114
        long pauseBase = (t instanceof CallQueueTooBigException) ? pauseForCQTBE : pause;  // line 151
        expectedSleep = callable.sleep(pauseBase, tries); //  line 152

However, the client log shows that some retries are not due to dead server:

2022-04-10 15:44:44,146 INFO  [main] client.RpcRetryingCallerImpl: Call exception, tries=6, retries=16, started=4858 ms ago, cancelled=false, msg=org.apache.hadoop.hbase.PleaseHoldException: Master is initializing
        at org.apache.hadoop.hbase.master.HMaster.checkInitialized(HMaster.java:2729)
        at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:1985)
        at org.apache.hadoop.hbase.master.MasterRpcServices.createTable(MasterRpcServices.java:705)
        at org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
, details=, see https://s.apache.org/timeout
2022-04-10 15:44:48,181 INFO  [main] client.RpcRetryingCallerImpl: Call exception, tries=7, retries=16, started=8893 ms ago, cancelled=false, msg=java.io.IOException: java.io.IOException: IOException
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:463)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
Caused by: java.io.UncheckedIOException: java.io.IOException: IOException
        at org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.lambda$update$7(RegionProcedureStore.java:377)
        at org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.runWithoutRpcCall(RegionProcedureStore.java:313)
        at org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.update(RegionProcedureStore.java:364)
        at org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.insert(RegionProcedureStore.java:323)
        at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.submitProcedure(ProcedureExecutor.java:1047)
        at org.apache.hadoop.hbase.master.procedure.MasterProcedureUtil$NonceProcedureRunnable.submitProcedure(MasterProcedureUtil.java:107)
        at org.apache.hadoop.hbase.master.HMaster$4.run(HMaster.java:2010)
        at org.apache.hadoop.hbase.master.procedure.MasterProcedureUtil.submitProcedure(MasterProcedureUtil.java:134)
        at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:1997)
        at org.apache.hadoop.hbase.master.MasterRpcServices.createTable(MasterRpcServices.java:705)
        at org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395)
        ... 3 more
Caused by: java.io.IOException: IOException
        at org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.lambda$update$7(RegionProcedureStore.java:370)
        ... 14 more
, details=, see https://s.apache.org/timeout
2022-04-10 15:44:58,835 INFO  [main] client.HBaseAdmin: Operation: CREATE, Table Name: default:t4, procId: 22 completed

The client has some retries due to dead server (PleaseHoldException: Master is initializing), and then gets some internal error when creating the table and finally gets the table created successfully. We understand that the waiting time for each retry increases exponentially as the retry number grows, according to the retry backoff mechanism. However, since the comments before line 151 does not specify how we should determine the retry pause time, we suspect that the retry time after the successful server startup should be considered separately, regardless of the time used for waiting for server startup. To confirm this hypothesis, we find some RPC retry call implementations in the same version (2.4.2) of HBase: https://github.com/apache/hbase/blob/rel/2.4.2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RemoteProcedureResultReporter.java#L95-L111 and https://github.com/apache/hbase/blob/rel/2.4.2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L2453-L2475 .

For example, in RemoteProcedureResultReporter, the retry time is calculated separately, after the server startup:

   while (!server.isStopped()) {
      // ...
      try {
        server.reportProcedureDone(builder.build());
        builder.clear();
        tries = 0;
      } catch (IOException e) {
        boolean pause =
          e instanceof ServerNotRunningYetException || e instanceof PleaseHoldException;
        long pauseTime;
        if (pause) {
          // Do backoff else we flood the Master with requests.
          pauseTime = ConnectionUtils.getPauseTime(INIT_PAUSE_TIME_MS, tries);
        } else {
          pauseTime = INIT_PAUSE_TIME_MS; // Reset.
        }
        LOG.info("Failed procedure report " + TextFormat.shortDebugString(request) + "; retry (#" +
          tries + ")" + (pause ? " after " + pauseTime + "ms delay (Master is coming online...)."
            : " immediately."),
          e);
        Threads.sleep(pauseTime);
        tries++;
      }
    }

Similarly, we suspect RpcRetryingCallerImpl should consider ServerNotRunningYetException and PleaseHoldException in this way. We followed this thought to write a fix and found that the symptom no longer exists. We had tested this issue and the fix in version 2.4.2 and 2.4.11 and 3.0.0-alpha-2. We believe this issue exists in these versions and our fix is applicable to these versions.

P.S. In the latest version (3.0.0-alpha-2), RpcRetryingCallerImpl no longer exists. The issue and fix is applied to AsyncRpcRetryingCaller instead.

Improvement of the pause time between retries in Rpc caller

Details

Description

Attachments

Issue Links

Activity

People

Dates