Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-23079

RegionRemoteProcedureBase should override setTimeoutFailure

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • None
    • 3.0.0-alpha-1, 2.3.0, 2.2.2
    • amv2
    • None
    • Reviewed

    Description

      In our test environment, we found that, when OpenRegionProcedure failed by timeout, though it doesn't support rollback, ProcedureExcutor will delete the OPR node in the store in its own rollback steps and make the RIT stay there looping rollback exception. ServerCrashProcedure will detect this RIT but cannot deal with it, because no node in the store and it will encounters NPE, as a result, SCP aborts.

       

      Logs are as follows,

      2019-08-29,07:27:35,710 INFO [PEWorker-15] org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: pid=70206, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true found RIT pid=56043, ppid=55625, state=ROLLEDBACK; TransitRegionStateProcedure table=c3srv_galaxy:tsdb, region=c1ccd94593bf1b87269cec98d6ffaaae, ASSIGN; rit=OPEN, location=c3-hadoop-srv-st297.bj,21600,1567012736142, table=c3srv_galaxy:tsdb, region=c1ccd94593bf1b87269cec98d6ffaaae
      2019-08-29,07:27:35,711 ERROR [PEWorker-15] org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception: pid=70206, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true
      java.lang.NullPointerException
      at org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.update(ProcedureStoreTracker.java:140)
      at org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.update(ProcedureStoreTracker.java:133)
      at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.updateStoreTracker(WALProcedureStore.java:782)
      at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.pushData(WALProcedureStore.java:737)
      at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.update(WALProcedureStore.java:604)
      at org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.persistAndWake(RegionRemoteProcedureBase.java:182)
      at org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.serverCrashed(RegionRemoteProcedureBase.java:239)
      at org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.serverCrashed(TransitRegionStateProcedure.java:398)
      at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.assignRegions(ServerCrashProcedure.java:461)
      at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:221)
      at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:64)
      at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:189)
      at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962)
      at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1645)
      at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1392)
      at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
      at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1962)
      2019-08-29,07:27:35,714 ERROR [PEWorker-15] org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception for pid=70206, state=FAILED:SERVER_CRASH_SPLIT_LOGS, locked=true, exception=java.lang.NullPointerException via CODE-BUG: Uncaught runtime exception: pid=70206, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true:java.lang.NullPointerException; ServerCrashProcedure server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true
      java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_ASSIGN
      at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:333)
      at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:64)
      at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:208)
      at org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:979)
      at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1566)
      at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1498)
      at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1349)
      at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
      at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1962)
      2019-08-29,07:27:35,716 ERROR [PEWorker-15] org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception for pid=70206, state=FAILED:SERVER_CRASH_GET_REGIONS, locked=true, exception=java.lang.NullPointerException via CODE-BUG: Uncaught runtime exception: pid=70206, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true:java.lang.NullPointerException; ServerCrashProcedure server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true
      java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_SPLIT_LOGS
      at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:333)
      at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:64)
      at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:208)
      at org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:979)

       

      -------------

      [work@c3-hadoop-srv-ct30 log]$ grep 'pid=56659' *
      hbase.4500.log:2019-08-29,01:40:31,103 INFO [master/c3-hadoop-srv-ct30:21500:becomeActiveMaster] org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock for pid=56659, ppid=56653, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED; TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service, region=413a5199ff68f3505f490c8341efe459, ASSIGN
      hbase.4500.log:2019-08-29,01:40:31,107 INFO [master/c3-hadoop-srv-ct30:21500:becomeActiveMaster] org.apache.hadoop.hbase.procedure2.TimeoutExecutorThread: ADDED pid=56661, ppid=56659, state=WAITING_TIMEOUT; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure; timeout=1006, timestamp=1567014012995
      hbase.4500.log:2019-08-29,01:40:31,186 INFO [master/c3-hadoop-srv-ct30:21500:becomeActiveMaster] org.apache.hadoop.hbase.master.assignment.AssignmentManager: Attach pid=56659, ppid=56653, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED; TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service, region=413a5199ff68f3505f490c8341efe459, ASSIGN to rit=OFFLINE, location=null, table=c3prck8smonitorc3:tsdb-service, region=413a5199ff68f3505f490c8341efe459 to restore RIT
      hbase.4500.log:2019-08-29,01:40:31,676 INFO [PEWorker-16] org.apache.hadoop.hbase.procedure2.ProcedureExecutor: LOCK_EVENT_WAIT rollback...pid=56661, ppid=56659, state=FAILED, exception=org.apache.hadoop.hbase.exceptions.TimeoutIOException via ProcedureExecutor:org.apache.hadoop.hbase.exceptions.TimeoutIOException: Operation timed out after 19.6860sec; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
      hbase.4500.log:2019-08-29,01:40:43,155 ERROR [PEWorker-7] org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception for pid=56661, ppid=56659, state=FAILED, locked=true, exception=org.apache.hadoop.hbase.exceptions.TimeoutIOException via ProcedureExecutor:org.apache.hadoop.hbase.exceptions.TimeoutIOException: Operation timed out after 19.6860sec; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
      hbase.4500.log:2019-08-29,01:40:43,759 ERROR [PEWorker-7] org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception for pid=56661, ppid=56659, state=FAILED, locked=true, exception=org.apache.hadoop.hbase.exceptions.TimeoutIOException via ProcedureExecutor:org.apache.hadoop.hbase.exceptions.TimeoutIOException: Operation timed out after 19.6860sec; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
      hbase.4500.log:2019-08-29,01:40:44,064 ERROR [PEWorker-7] org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception for pid=56659, ppid=56653, state=WAITING:REGION_STATE_TRANSITION_OPEN, locked=true; TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service, region=413a5199ff68f3505f490c8341efe459, ASSIGN
      hbase.4500.log:2019-08-29,01:40:44,471 INFO [PEWorker-7] org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock for pid=56659, ppid=56653, state=WAITING:REGION_STATE_TRANSITION_OPEN; TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service, region=413a5199ff68f3505f490c8341efe459, ASSIGN
      hbase.4500.log:2019-08-29,01:40:44,573 ERROR [PEWorker-7] org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception for pid=56659, ppid=56653, state=WAITING:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service, region=413a5199ff68f3505f490c8341efe459, ASSIGN

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            zhangduo Duo Zhang
            Xiaolin Ha Xiaolin Ha
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment