Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-23594

Procedure stuck due to region happen to recorded on two servers.

    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      Master log:

      $ grep "cf9a4ec6cd890aa6806fb313d71e2ebd" hbase-hbaseadmin-master-100.107.176.225.log.1
      2019-12-17 11:24:03,534 DEBUG [KeepAlivePEWorker-20] procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false sharedLock=34 size=1662) to run queue because: the exclusive lock is not held by anyone when adding pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
      2019-12-17 11:24:22,851 INFO  [KeepAlivePEWorker-17] procedure.MasterProcedureScheduler: Took xlock for pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
      2019-12-17 11:24:22,852 INFO  [KeepAlivePEWorker-17] assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; rit=OPEN, location=null; forceNewPlan=true, retain=false
      2019-12-17 11:24:22,852 DEBUG [KeepAlivePEWorker-17] procedure2.RootProcedureState: Add procedure pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 51669th rollback step
      2019-12-17 11:24:22,858 DEBUG [master/100.107.176.225:60000] procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false sharedLock=349 size=1666) to run queue because: pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
      2019-12-17 11:24:22,912 INFO  [PEWorker-9] assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; openRegion rit=OPEN, location=100.107.176.215,60020,1576552834619; loc=100.107.176.215,60020,1576552834619
      2019-12-17 11:24:22,912 INFO  [PEWorker-9] assignment.RegionStateStore: pid=193706 updating hbase:meta row=cf9a4ec6cd890aa6806fb313d71e2ebd, regionState=OPENING, regionLocation=100.107.176.215,60020,1576552834619
      2019-12-17 11:24:22,912 DEBUG [PEWorker-9] procedure2.RootProcedureState: Add procedure pid=193706, ppid=187614, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52115th rollback step
      2019-12-17 11:24:22,918 WARN  [PEWorker-8] assignment.RegionRemoteProcedureBase: Can not add remote operation pid=243482, ppid=193706, state=RUNNABLE, locked=true; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region {ENCODED => cf9a4ec6cd890aa6806fb313d71e2ebd, NAME => 'table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd.', STARTKEY => 'user68694', ENDKEY => 'user68703'} to server 100.107.176.215,60020,1576552834619, this usually because the server is alread dead, give up and mark the procedure as complete, the parent procedure will take care of this.
      2019-12-17 11:24:22,921 DEBUG [PEWorker-8] procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false sharedLock=331 size=1664) to run queue because: pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
      2019-12-17 11:24:22,921 INFO  [PEWorker-8] procedure2.ProcedureExecutor: Finished subprocedure pid=243482, resume processing parent pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
      2019-12-17 11:24:22,921 INFO  [PEWorker-9] assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; rit=OPENING, location=100.107.176.215,60020,1576552834619
      2019-12-17 11:24:22,921 DEBUG [PEWorker-9] procedure2.RootProcedureState: Add procedure pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52157th rollback step
      2019-12-17 11:24:22,922 INFO  [PEWorker-9] assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; rit=OPENING, location=null; forceNewPlan=true, retain=false
      2019-12-17 11:24:22,922 DEBUG [PEWorker-9] procedure2.RootProcedureState: Add procedure pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52164th rollback step
      2019-12-17 11:24:22,930 DEBUG [master/100.107.176.225:60000] procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false sharedLock=331 size=1661) to run queue because: pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
      2019-12-17 11:24:22,943 INFO  [PEWorker-14] assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; openRegion rit=OPENING, location=100.107.164.90,60020,1576553001648; loc=100.107.164.90,60020,1576553001648
      2019-12-17 11:24:22,943 INFO  [PEWorker-14] assignment.RegionStateStore: pid=193706 updating hbase:meta row=cf9a4ec6cd890aa6806fb313d71e2ebd, regionState=OPENING, regionLocation=100.107.164.90,60020,1576553001648
      2019-12-17 11:24:22,944 DEBUG [PEWorker-14] procedure2.RootProcedureState: Add procedure pid=193706, ppid=187614, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52284th rollback step
      2019-12-17 11:24:23,000 INFO  [PEWorker-11] assignment.RegionStateStore: pid=193706 updating hbase:meta row=cf9a4ec6cd890aa6806fb313d71e2ebd, regionState=OPEN, openSeqNum=17, regionLocation=100.107.164.90,60020,1576553001648
      2019-12-17 11:24:23,003 DEBUG [PEWorker-11] procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false sharedLock=336 size=1635) to run queue because: pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
      2019-12-17 11:24:23,003 INFO  [PEWorker-11] procedure2.ProcedureExecutor: Finished subprocedure pid=243523, resume processing parent pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
      2019-12-17 11:24:23,003 DEBUG [KeepAlivePEWorker-17] procedure2.RootProcedureState: Add procedure pid=193706, ppid=187614, state=SUCCESS, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52624th rollback step
      2019-12-17 11:24:23,005 INFO  [KeepAlivePEWorker-17] procedure2.ProcedureExecutor: Finished pid=193706, ppid=187614, state=SUCCESS; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN in 19.7230sec
      2019-12-17 11:24:28,902 DEBUG [PEWorker-7] procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false sharedLock=27 size=1856) to run queue because: the exclusive lock is not held by anyone when adding pid=258071, ppid=243483, state=RUNNABLE:REGION_STATE_TRANSITION_GE
      
      
      
      
      2019-12-17 11:24:22,914 DEBUG [RegionServerTracker-0] procedure2.ProcedureExecutor: Stored pid=243483, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure server=100.107.176.215,60020,1576552834619, splitWal=true, meta=false
      
      
      
      
      
      2019-12-17 11:24:39,485 INFO  [PEWorker-10] procedure.MasterProcedureScheduler: Took xlock for pid=258071, ppid=243483, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
      2019-12-17 11:24:39,486 INFO  [PEWorker-10] assignment.TransitRegionStateProcedure: Starting pid=258071, ppid=243483, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; rit=OPEN, location=null; forceNewPlan=true, retain=false
      2019-12-17 11:24:39,486 DEBUG [PEWorker-10] procedure2.RootProcedureState: Add procedure pid=258071, ppid=243483, state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 68029th rollback step
      2019-12-17 11:24:39,571 DEBUG [master/100.107.176.225:60000] procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false sharedLock=47 size=1847) to run queue because: pid=258071, ppid=243483, state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
      2019-12-17 11:24:39,625 INFO  [PEWorker-16] assignment.TransitRegionStateProcedure: Starting pid=258071, ppid=243483, state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; openRegion rit=OPEN, location=100.107.164.90,60020,1576553001648; loc=100.107.164.90,60020,1576553001648
      2019-12-17 11:24:39,625 INFO  [PEWorker-16] assignment.RegionStateStore: pid=258071 updating hbase:meta row=cf9a4ec6cd890aa6806fb313d71e2ebd, regionState=OPENING, regionLocation=100.107.164.90,60020,1576553001648
      2019-12-17 11:24:39,626 DEBUG [PEWorker-16] procedure2.RootProcedureState: Add procedure pid=258071, ppid=243483, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 68773th rollback step
      2019-12-17 11:26:23,217 WARN  [ProcExecTimeout] assignment.AssignmentManager: STUCK Region-In-Transition rit=OPENING, location=100.107.164.90,60020,1576553001648, table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd
      2019-12-17 11:27:23,217 WARN  [ProcExecTimeout] assignment.AssignmentManager: STUCK Region-In-Transition rit=OPENING, location=100.107.164.90,60020,1576553001648, table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd
      2019-12-17 11:28:06,532 INFO  [master/100.107.176.225:60000.Chore.1] master.HMaster: Not running balancer because 1 region(s) in transition: [rit=OPENING, location=100.107.164.90,60020,1576553001648, table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd]
      2019-12-17 11:28:23,218 WARN  [ProcExecTimeout] assignment.AssignmentManager: STUCK Region-In-Transition rit=OPENING, location=100.107.164.90,60020,1576553001648, table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd
      2019-12-17 11:29:23,218 WARN  [ProcExecTimeout] assignment.AssignmentManager: STUCK Region-In-Transition rit=OPENING, location=100.107.164.90,60020,1576553001648, table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd
      2019-12-17 11:30:23,219 WARN  [ProcExecTimeout] assignment.AssignmentManager: STUCK Region-In-Transition rit=OPENING, location=100.107.164.90,60020,1576553001648, table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd
      

      RegionServer log:

      $ grep "cf9a4ec6cd890aa6806fb313d71e2ebd" hbase-hbaseadmin-regionserver-100.107.164.90.log
      2019-12-17 11:24:22,972 INFO  [RS_OPEN_REGION-regionserver/100.107.164.90:60020-1] handler.AssignRegionHandler: Open table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd.
      2019-12-17 11:24:22,972 INFO  [RS_OPEN_REGION-regionserver/100.107.164.90:60020-1] regionserver.HRegion: filter rows disabled with empty filtering pattern for region=table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd..
      2019-12-17 11:24:22,983 INFO  [StoreOpener-cf9a4ec6cd890aa6806fb313d71e2ebd-1] hfile.CacheConfig: Created cacheConfig: cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=true, cacheDataCompressed=false, prefetchOnOpen=false for family {NAME => 'cf', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'false', DATA_BLOCK_ENCODING => 'DIFF', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTER => 'ROW', CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'SNAPPY', BLOCKCACHE => 'true', BLOCKSIZE => '65536'} with blockCache=org.apache.hadoop.hbase.io.hfile.CombinedBlockCache@1376a714
      2019-12-17 11:24:22,983 INFO  [StoreOpener-cf9a4ec6cd890aa6806fb313d71e2ebd-1] compactions.CompactionConfiguration: size [128 MB, 2 GB, 2 GB); files [7, 25); ratio 1.200000; off-peak ratio 5.000000; throttle point 4294967296; major period 0, major jitter 0.500000, min locality to compact 0.000000; tiered compaction: max_age 9223372036854775807, incoming window min 6, compaction policy for tiered window org.apache.hadoop.hbase.regionserver.compactions.ExploringCompactionPolicy, single output for minor true, compaction window factory org.apache.hadoop.hbase.regionserver.compactions.ExponentialCompactionWindowFactory
      2019-12-17 11:24:22,984 INFO  [StoreOpener-cf9a4ec6cd890aa6806fb313d71e2ebd-1] regionserver.HStore: Store=cf,  memstore type=DefaultMemStore, storagePolicy=HOT, verifyBulkLoads=false, parallelPutCountPrintThreshold=50, encoding=DIFF, compression=SNAPPY
      2019-12-17 11:24:22,987 INFO  [RS_OPEN_REGION-regionserver/100.107.164.90:60020-1] regionserver.HRegion: Opened cf9a4ec6cd890aa6806fb313d71e2ebd; next sequenceid=17
      2019-12-17 11:24:22,987 INFO  [RS_OPEN_REGION-regionserver/100.107.164.90:60020-1] regionserver.HRegionServer: Post open deploy tasks for table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd., openProcId=243523, masterSystemTime=1576553062965
      2019-12-17 11:24:22,990 INFO  [RS_OPEN_REGION-regionserver/100.107.164.90:60020-1] handler.AssignRegionHandler: Opened table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd.
      2019-12-17 11:24:39,657 WARN  [RS_OPEN_REGION-regionserver/100.107.164.90:60020-1] handler.AssignRegionHandler: Received OPEN for the region:table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd., which is already online
      

      2019-12-17 11:23:49,598 DEBUG [RegionServerTracker-0] procedure2.ProcedureExecutor: Stored pid=187614, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure server=100.107.165.41,60020,1576552792328, splitWal=true, meta=true

      Process ServerCrashProcedure(100.107.165.41,60020,1576552792328) assign region cf9a4ec6cd890aa6806fb313d71e2ebd,
      region assign to 100.107.176.215,60020,1576552834619, but failed, so retry and assign to 100.107.164.90,60020,1576553001648 and open on 100.107.164.90,60020,1576553001648 success.
      Later process ServerCrashProcedure(100.107.176.215,60020,1576552834619) and find region cf9a4ec6cd890aa6806fb313d71e2ebd on this server also, than try assign this region, and assign the region to 100.107.164.90,60020,1576553001648 also, but 100.107.164.90,60020,1576553001648 find it has open this region already, than the assign stuck.

      Attachments

        1. HBASE-23594-UT.patch
          6 kB
          Duo Zhang

        Issue Links

          Activity

            People

              zhangduo Duo Zhang
              binlijin Lijin Bin
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: