Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-13172

TestDistributedLogSplitting.testThreeRSAbort fails several times on branch-1

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.1.0
    • Fix Version/s: 1.0.1, 1.1.0, 0.98.12
    • Component/s: test
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      The direct reason is we are stuck in ServerManager.isServerReachable.

      https://builds.apache.org/job/HBase-1.1/253/testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testThreeRSAbort/

      2015-03-06 04:06:19,430 DEBUG [AM.-pool300-t1] master.ServerManager(855): Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=0 of 10
      2015-03-06 04:07:10,545 DEBUG [AM.-pool300-t1] master.ServerManager(855): Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=9 of 10
      

      The interval between first and last retry log is about 1 minute, and we only wait 1 minute so the test is timeout.
      Still do not know why this happen.

      And at last there are lots of this

      2015-03-06 04:07:21,529 DEBUG [AM.-pool300-t1] master.ServerManager(855): Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=9 of 10
      org.apache.hadoop.hbase.ipc.StoppedRpcClientException
      	at org.apache.hadoop.hbase.ipc.RpcClientImpl.getConnection(RpcClientImpl.java:1261)
      	at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1146)
      	at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
      	at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
      	at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.getServerInfo(AdminProtos.java:22031)
      	at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1797)
      	at org.apache.hadoop.hbase.master.ServerManager.isServerReachable(ServerManager.java:850)
      	at org.apache.hadoop.hbase.master.RegionStates.isServerDeadAndNotProcessed(RegionStates.java:843)
      	at org.apache.hadoop.hbase.master.AssignmentManager.forceRegionStateToOffline(AssignmentManager.java:1969)
      	at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1576)
      	at org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:48)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:744)
      

      I think the problem is here

      ServerManager.java
          while (retryCounter.shouldRetry()) {
              ...
              try {
                retryCounter.sleepUntilNextRetry();
              } catch(InterruptedException ie) {
                Thread.currentThread().interrupt();
              }
              ...
          }
      

      We need to break out of the while loop when getting InterruptedException, not just mark current thread as interrupted.

        Attachments

        1. HBASE-13172-branch-1.patch
          4 kB
          Duo Zhang

          Activity

            People

            • Assignee:
              zhangduo Duo Zhang
              Reporter:
              zhangduo Duo Zhang
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: