Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-20680

Master hung during initialization waiting on hbase:meta to be assigned which never does

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Incomplete
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      When running IntegrationTestRSGroups, the test became hung waiting on the master to be initialized.

      The hbase cluster was launched without RSGroup config. The test script adds required RSGroup configs to hbase-site.xml and restarts the cluster.

      It seems that, at one point while the master was trying to assign meta, the destination regionserver was in the middle of going down. This has now left HBase in a state where it starts the regionserver recovery procedures, but never actually gets hbase:meta assigned.

      2018-06-01 10:47:50,024 INFO  [PEWorker-5] procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=41, ppid=40, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, region=1588230740}]
      
      2018-06-01 10:47:50,026 DEBUG [WALProcedureStoreSyncThread] wal.WALProcedureStore: hsync completed for hdfs://ctr-e138-1518143905142-340983-03-000014.hwx.site:8020/apps/hbase/data/MasterProcWALs/pv2-00000000000000000002.log
      
      2018-06-01 10:47:50,026 INFO  [PEWorker-3] procedure.MasterProcedureScheduler: pid=41, ppid=40, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta,                 region=1588230740 checking lock on 1588230740
      
      2018-06-01 10:47:50,026 DEBUG [PEWorker-3] assignment.RegionStates: setting location=ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190 for rit=OFFLINE, location=ctr-  e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190, table=hbase:meta, region=1588230740 last loc=null
      
      2018-06-01 10:47:50,026 INFO  [PEWorker-3] assignment.AssignProcedure: Starting pid=41, ppid=40, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta,                region=1588230740; rit=OFFLINE, location=ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190; forceNewPlan=false, retain=true target svr=null
      

      At Fri Jun 1 10:48:04, master was restarted.

      The new master picked up pid=41:

      2018-06-01 10:48:47,971 INFO  [PEWorker-1] assignment.AssignProcedure: Starting pid=41, ppid=40, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta,                region=1588230740; rit=OFFLINE, location=null; forceNewPlan=false, retain=false target svr=null
      

      There was no further log for pid=41 after above.

      Later when master initiated another meta recovery procedure (pid=42), the second procedure seems to be locked out by the former:

      2018-06-01 10:49:34,292 INFO  [PEWorker-2] procedure.MasterProcedureScheduler: pid=43, ppid=42, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta,                 region=1588230740, target=ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190 checking lock on 1588230740
      
      2018-06-01 10:49:34,293 DEBUG [PEWorker-2] assignment.RegionTransitionProcedure: LOCK_EVENT_WAIT pid=43 serverLocks={}, namespaceLocks={}, tableLocks={},                                 regionLocks={{1588230740=exclusiveLockOwner=41, sharedLockCount=0, waitingProcCount=1}}, peerLocks={}
      

        Attachments

        1. 20680-logs.tar.gz
          8.24 MB
          Ted Yu

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              elserj Josh Elser
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: