Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-21120

MoveRegionProcedure makes no progress; goes to STUCK

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.1.1, 2.0.2
    • amv2
    • None

    Description

      This is tip of branch-2.0. I'm running cluster tests. I've seen this twice now where a slowDeterministic schedules a move procedure just after startup and we go dead-in-the-water never going beyond schedule of the unassign.We even schedule subsequent moves but they go no where either.

      ...
      2018-08-26 10:19:30,352 INFO  [PEWorker-14] procedure2.ProcedureExecutor: Finished pid=286, ppid=260, state=SUCCESS; MoveRegionProcedure hri=fb79d389cf324aaf3ad4b686d3ad21d5, source=ve0530.halxg.cloudera.com,16020,1535303391139, destination=ve0530.halxg.cloudera.com,16020,1535303391139 in 10.8100sec
      
      ...
      // MASTER KILLED
      
      ....
      2018-08-26 10:21:10,585 INFO  [Thread-19] assignment.RegionStateStore: Load hbase:meta entry region=fb79d389cf324aaf3ad4b686d3ad21d5, regionState=OPEN, lastHost=ve0530.halxg.cloudera.com,16020,1535303391139, regionLocation=ve0530.halxg.cloudera.com,16020,1535303391139, openSeqNum=42095
      ...
      2018-08-26 10:26:15,531 INFO  [RpcServer.default.FPBQ.Fifo.handler=46,queue=1,port=16000] master.HMaster: Client=stack//10.17.240.20 move hri=fb79d389cf324aaf3ad4b686d3ad21d5, source=ve0530.halxg.cloudera.com,16020,1535303391139, destination=ve0534.halxg.cloudera.com,16020,1535303812180, running balancer
      2018-08-26 10:26:16,635 INFO  [PEWorker-13] procedure.MasterProcedureScheduler: pid=413, state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure hri=fb79d389cf324aaf3ad4b686d3ad21d5, source=ve0530.halxg.cloudera.com,16020,1535303391139, destination=ve0534.halxg.cloudera.com,16020,1535303812180 checking lock on fb79d389cf324aaf3ad4b686d3ad21d5
      2018-08-26 10:26:16,896 INFO  [PEWorker-14] procedure2.ProcedureExecutor: Finished pid=405, state=SUCCESS; ServerCrashProcedure server=ve0540.halxg.cloudera.com,16020,1535303391120, splitWal=true, meta=true in 25.8450sec
      2018-08-26 10:26:16,896 INFO  [PEWorker-13] procedure.MasterProcedureScheduler: pid=414, ppid=413, state=RUNNABLE:REGION_TRANSITION_QUEUE; UnassignProcedure table=IntegrationTestBigLinkedList, region=fb79d389cf324aaf3ad4b686d3ad21d5, server=ve0530.halxg.cloudera.com,16020,1535303391139 checking lock on fb79d389cf324aaf3ad4b686d3ad21d5
      2018-08-26 10:26:29,308 INFO  [RpcServer.default.FPBQ.Fifo.handler=0,queue=0,port=16000] master.ServerManager: Registering regionserver=ve0540.halxg.cloudera.com,16020,1535304386779                                                                                  2018-08-26 10:26:29,342 INFO  [RegionServerTracker-0] master.RegionServerTracker: RegionServer ephemeral node created, adding [ve0540.halxg.cloudera.com,16020,1535304386779]                                                                                          2018-08-26 10:27:03,820 INFO  [ReadOnlyZKClient-ve0524.halxg.cloudera.com:2222@0x1da09819] zookeeper.ZooKeeper: Session: 0x16577368c840054 closed
      2018-08-26 10:27:03,820 INFO  [ReadOnlyZKClient-ve0524.halxg.cloudera.com:2222@0x1da09819-EventThread] zookeeper.ClientCnxn: EventThread shut down for session: 0x16577368c840054                                                                                      2018-08-26 10:27:16,411 INFO  [RpcServer.default.FPBQ.Fifo.handler=0,queue=0,port=16000] master.HMaster: Client=stack//10.17.240.20 move hri=fb79d389cf324aaf3ad4b686d3ad21d5, source=ve0530.halxg.cloudera.com,16020,1535303391139, destination=ve0534.halxg.cloudera.com,16020,1535303812180, running balancer                                                                                                                                                                                                                              2018-08-26 10:27:17,001 INFO  [PEWorker-13] procedure.MasterProcedureScheduler: pid=415, state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure hri=fb79d389cf324aaf3ad4b686d3ad21d5, source=ve0530.halxg
      .cloudera.com,16020,1535303391139, destination=ve0534.halxg.cloudera.com,16020,1535303812180 checking lock on fb79d389cf324aaf3ad4b686d3ad21d5
      ...
      

      Looking in UI, it says the below which has NO mention of pid=414 ... like it disappeared!!!

      413		WAITING	stack	MoveRegionProcedure hri=fb79d389cf324aaf3ad4b686d3ad21d5, source=ve0530.halxg.cloudera.com,16020,1535303391139, destination=ve0534.halxg.cloudera.com,16020,1535303812180	Sun Aug 26 10:26:15 PDT 2018	Sun Aug 26 10:26:16 PDT 2018		[ { state => [ '1', '2' ] }, { regionId => '1535303851257', tableName => { namespace => 'ZGVmYXVsdA==', qualifier => 'SW50ZWdyYXRpb25UZXN0QmlnTGlua2VkTGlzdA==' }, startKey => 'mGGOysI+5tUhxwXKql8zRg==', endKey => 'nnnnnnnnnnA=', offline => 'false', split => 'false', replicaId => '0' }, { sourceServer => { hostName => 've0530.halxg.cloudera.com', port => '16020', startCode => '1535303391139' }, destinationServer => { hostName => 've0534.halxg.cloudera.com', port => '16020', startCode => '1535303812180' } } ]
      415		RUNNABLE	stack	MoveRegionProcedure hri=fb79d389cf324aaf3ad4b686d3ad21d5, source=ve0530.halxg.cloudera.com,16020,1535303391139, destination=ve0534.halxg.cloudera.com,16020,1535303812180	Sun Aug 26 10:27:16 PDT 2018	Sun Aug 26 10:27:16 PDT 2018		[ {}, { regionId => '1535303851257', tableName => { namespace => 'ZGVmYXVsdA==', qualifier => 'SW50ZWdyYXRpb25UZXN0QmlnTGlua2VkTGlzdA==' }, startKey => 'mGGOysI+5tUhxwXKql8zRg==', endKey => 'nnnnnnnnnnA=', offline => 'false', split => 'false', replicaId => '0' }, { sourceServer => { hostName => 've0530.halxg.cloudera.com', port => '16020', startCode => '1535303391139' }, destinationServer => { hostName => 've0534.halxg.cloudera.com', port => '16020', startCode => '1535303812180' } } ]
      

      Looking at locks in UI, it says that pid=413 has the exclusive lock on this region...

      REGION: fb79d389cf324aaf3ad4b686d3ad21d5
      Lock type: EXCLUSIVE
      
      Owner procedure: { ID => '413', PARENT_ID => '-1', STATE => 'WAITING', OWNER => 'stack', TYPE => 'MoveRegionProcedure hri=fb79d389cf324aaf3ad4b686d3ad21d5, source=ve0530.halxg.cloudera.com,16020,1535303391139, destination=ve0534.halxg.cloudera.com,16020,1535303812180', START_TIME => 'Sun Aug 26 10:26:15 PDT 2018', LAST_UPDATE => 'Sun Aug 26 10:26:16 PDT 2018', PARAMETERS => [ { state => [ '1', '2' ] }, { regionId => '1535303851257', tableName => { namespace => 'ZGVmYXVsdA==', qualifier => 'SW50ZWdyYXRpb25UZXN0QmlnTGlua2VkTGlzdA==' }, startKey => 'mGGOysI+5tUhxwXKql8zRg==', endKey => 'nnnnnnnnnnA=', offline => 'false', split => 'false', replicaId => '0' }, { sourceServer => { hostName => 've0530.halxg.cloudera.com', port => '16020', startCode => '1535303391139' }, destinationServer => { hostName => 've0534.halxg.cloudera.com', port => '16020', startCode => '1535303812180' } } ] }
      

      Is it that the child is not picking up the lock? Let me look at the WALs to see what story with pid=414 is....

      Attachments

        1. HBASE-21120.apply-in-reverse.txt
          3 kB
          Michael Stack

        Activity

          People

            stack Michael Stack
            stack Michael Stack
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: