Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-23895

STUCK Region-In-Transition when failed to insert procedure to procedure store

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      When move an region, it will generate a TRSP first and set the procedure to the region state node. But if the submit TRSP failed, the procedure cannot be unset now and the region will stuck in RIT.

      hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java

      public Future<byte[]> moveAsync(RegionPlan regionPlan) throws HBaseIOException {
          TransitRegionStateProcedure proc =
            createMoveRegionProcedure(regionPlan.getRegionInfo(), regionPlan.getDestination());
          return ProcedureSyncWait.submitProcedure(master.getMasterProcedureExecutor(), proc);
        }
      
        public TransitRegionStateProcedure createMoveRegionProcedure(RegionInfo regionInfo,
            ServerName targetServer) throws HBaseIOException {
          RegionStateNode regionNode = this.regionStates.getRegionStateNode(regionInfo);
          if (regionNode == null) {
            throw new UnknownRegionException("No RegionStateNode found for " +
                regionInfo.getEncodedName() + "(Closed/Deleted?)");
          }    
          TransitRegionStateProcedure proc;
          regionNode.lock();
          try {
            preTransitCheck(regionNode, STATES_EXPECTED_ON_UNASSIGN_OR_MOVE);
            regionNode.checkOnline();
            proc = TransitRegionStateProcedure.move(getProcedureEnvironment(), regionInfo, targetServer);
            regionNode.setProcedure(proc);
          } finally {
            regionNode.unlock();
          }    
          return proc;
        }
      

      hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStateNode.java

        public void setProcedure(TransitRegionStateProcedure proc) {
          assert this.procedure == null;
          this.procedure = proc;
          ritMap.put(regionInfo, this);
        }
      
        public void unsetProcedure(TransitRegionStateProcedure proc) {
          assert this.procedure == proc;
          this.procedure = null;
          ritMap.remove(regionInfo, this);
        } 
      
      2020-02-26,13:45:21,344 ERROR [RpcServer.default.RWQ.Fifo.read.handler=437,queue=5,port=21500] org.apache.hadoop.hbase.ipc.RpcServer: Unexpected throwable object
      java.io.UncheckedIOException: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Timed out waiting for lock for row: \x00\x00\x00\x00\x00\x0B\xAB\xD2 in region 9731aea823e7f83264b14713ae486fb7
              at org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.update(RegionProcedureStore.java:588)
              at org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.insert(RegionProcedureStore.java:545)
              at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.submitProcedure(ProcedureExecutor.java:1042)
              at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.submitProcedure(ProcedureExecutor.java:860)
              at org.apache.hadoop.hbase.master.procedure.ProcedureSyncWait.submitProcedure(ProcedureSyncWait.java:123)
              at org.apache.hadoop.hbase.master.assignment.AssignmentManager.moveAsync(AssignmentManager.java:657)
              at org.apache.hadoop.hbase.master.HMaster.executeRegionPlansWithThrottling(HMaster.java:1793)
              at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1761)
              at org.apache.hadoop.hbase.master.MasterRpcServices.balance(MasterRpcServices.java:654)
              at org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
              at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:374)
              at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:135)
              at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:352)
              at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:332)
      Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Timed out waiting for lock for row: \x00\x00\x00\x00\x00\x0B\xAB\xD2 in region 9731aea823e7f83264b14713ae486fb7
              at org.apache.hadoop.hbase.regionserver.HRegion.getRowLockInternal(HRegion.java:6158)
              at org.apache.hadoop.hbase.regionserver.HRegion$BatchOperation.lockRowsAndBuildMiniBatch(HRegion.java:3488)
              at org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutate(HRegion.java:4235)
              at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:4208)
              at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:4134)
              at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:4125)
              at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:4139)
              at org.apache.hadoop.hbase.regionserver.HRegion.doBatchMutate(HRegion.java:4511)
              at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:3209)
              at org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.update(RegionProcedureStore.java:584)
              ... 13 more
      

      Attachments

        1. suggestion.patch
          15 kB
          Michael Stack

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            zghao Guanghao Zhang
            zghao Guanghao Zhang
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment