XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.1.0, 2.0.2
    • 2.1.1, 2.0.3
    • amv2, Balancer

    Description

      We have a case that a region shows status OPEN on a already dead server in meta table(it is hard to trace how this happen), meaning this region is actually not online. But balance came and scheduled a MoveReionProcedure for this region, which created a mess:
      The balancer 'thought' this region was on the server which has the same address(but with different startcode). So it schedules a MRP from this online server to another, but the UnassignProcedure dispatch the unassign call to the dead server according to regionstate, which then found the server dead and schedule a SCP for the dead server. But since the UnassignProcedure's hostingServer is not accurate, the SCP can't interrupt it.
      So, in the end, the SCP can't finish since the UnassignProcedure has the region' lock, the UnassignProcedure can not finish since no one wake it, thus stuck.

      Here is log, notice that the server of the UnassignProcedure is 'hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584' but it was dispatch to 'hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964'

      2018-10-10 14:34:50,011 INFO  [PEWorker-4] assignment.RegionTransitionProcedure(252): Dispatch pid=13, ppid=12, state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584; rit=CLOSING, location=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964
      2018-10-10 14:34:50,011 WARN  [PEWorker-4] assignment.RegionTransitionProcedure(230): Remote call failed hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; pid=13, ppid=12, state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584; rit=CLOSING, location=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; exception=NoServerDispatchException
      org.apache.hadoop.hbase.procedure2.NoServerDispatchException: hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; pid=13, ppid=12, state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584
      
      //Then a SCP was scheduled
      2018-10-10 14:34:50,012 WARN  [PEWorker-4] master.ServerManager(635): Expiration of hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964 but server not online
      2018-10-10 14:34:50,012 INFO  [PEWorker-4] master.ServerManager(615): Processing expiration of hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964 on hb-uf6oyi699w8h700f0-001.hbase.rds. ,16000,1539088156164
      2018-10-10 14:34:50,017 DEBUG [PEWorker-4] procedure2.ProcedureExecutor(1089): Stored pid=14, state=RUNNABLE:SERVER_CRASH_START, hasLock=false; ServerCrashProcedure server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964, splitWal=true, meta=false
      
      //The SCP did not interrupt the UnassignProcedure but schedule new AssignProcedure for this region
      2018-10-10 14:34:50,043 DEBUG [PEWorker-6] procedure.ServerCrashProcedure(250): Done splitting WALs pid=14, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, hasLock=true; ServerCrashProcedure server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964, splitWal=true, meta=false
      2018-10-10 14:34:50,054 INFO  [PEWorker-8] procedure2.ProcedureExecutor(1691): Initialized subprocedures=[{pid=15, ppid=14, state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; AssignProcedure table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f}, {pid=16, ppid=14, state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; AssignProcedure table=hbase:req_intercept_rule, region=460481706415d776b3742f428a6f579b}, {pid=17, ppid=14, state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; AssignProcedure table=hbase:namespace, region=ec7a965e7302840120a5d8289947c40b}]
      

      Here I also added a safe fence in balancer, if such regions are found, balancing is skipped for safe.It should do no harm.

      Attachments

        1. HBASE-21288.branch-2.0.001.patch
          3 kB
          Allan Yang
        2. HBASE-21288.branch-2.0.002.patch
          3 kB
          Allan Yang

        Activity

          People

            allan163 Allan Yang
            allan163 Allan Yang
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: