Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-12534

Wrong region location cache in client after regions are moved

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Duplicate
    • 2.0.0
    • None
    • None

    Description

      In our 0.94 hbase cluster, we found that client got wrong region location cache and did not update it after a region is moved to another regionserver.
      The reason is wrong client config and bug in RpcRetryingCaller of hbase client.
      The rpc configs are following:

      hbase.rpc.timeout=1000
      hbase.client.pause=200
      hbase.client.operation.timeout=1200
      

      But the client retry number is 3

      hbase.client.retries.number=3
      

      Assumed that a region is at regionserver A before, and then it is moved to regionserver B. The client try to make a call to regionserver A and get an NotServingRegionException. For the rety number is not 1, the region server location cache is not cleaned. See: RpcRetryingCaller.java#141 and RegionServerCallable.java#127

        @Override
        public void throwable(Throwable t, boolean retrying) {
          if (t instanceof SocketTimeoutException ||
            ....
          } else if (t instanceof NotServingRegionException && !retrying) {
            // Purge cache entries for this specific region from hbase:meta cache
            // since we don't call connect(true) when number of retries is 1.
            getConnection().deleteCachedRegionLocation(location);
          }
        }
      

      But the call did not retry and throw an SocketTimeoutException for the time the call will take is larger than the operation timeout.See RpcRetryingCaller.java#152

              expectedSleep = callable.sleep(pause, tries + 1);
      
              // If, after the planned sleep, there won't be enough time left, we stop now.
              long duration = singleCallDuration(expectedSleep);
              if (duration > callTimeout) {
                String msg = "callTimeout=" + callTimeout + ", callDuration=" + duration +
                    ": " + callable.getExceptionMessageAdditionalDetail();
                throw (SocketTimeoutException)(new SocketTimeoutException(msg).initCause(t));
              }
      

      At last, the wrong region location will never be not cleaned up .

      lhofhansl
      In hbase 0.94, the MIN_RPC_TIMEOUT in singleCallDuration is 2000 in default, which trigger this bug.

        private long singleCallDuration(final long expectedSleep) {
          return (EnvironmentEdgeManager.currentTimeMillis() - this.globalStartTime)
            + MIN_RPC_TIMEOUT + expectedSleep;
        }
      

      But there is risk in master code too.

      Attachments

        1. HBASE-12534-0.94-v1.diff
          1 kB
          Shaohui Liu
        2. HBASE-12534-v1.diff
          7 kB
          Shaohui Liu

        Issue Links

          Activity

            People

              Unassigned Unassigned
              liushaohui Shaohui Liu
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: