Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-21623

ServerCrashProcedure can stomp on a RIT for a wrong server

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Patch Available
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: 3.0.0, 2.2.0
    • Fix Version/s: None
    • Component/s: amv2
    • Labels:
      None

      Description

      A server died while some region was being opened on it; eventually the open failed, and the RIT procedure started retrying on a different server.
      However, by then SCP for the dying server had already obtained the region from the list of regions on the old server, and proceeded to overwrite whatever the RIT was doing with a new server.

      2018-12-18 23:06:03,160 INFO  [PEWorker-14] procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=151404, ppid=151104, state=RUNNABLE, hasLock=false; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
      ...
      2018-12-18 23:06:38,208 INFO  [PEWorker-10] procedure.ServerCrashProcedure: Start pid=151632, state=RUNNABLE:SERVER_CRASH_START, hasLock=true; ServerCrashProcedure server=oldServer,17020,1545202098577, splitWal=true, meta=false
      ...
      2018-12-18 23:06:41,953 WARN  [RSProcedureDispatcher-pool4-t115] assignment.RegionRemoteProcedureBase: The remote operation pid=151404, ppid=151104, state=RUNNABLE, hasLock=false; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region {ENCODED => region1, ... } to server oldServer,17020,1545202098577 failed
      org.apache.hadoop.hbase.regionserver.RegionServerAbortedException: org.apache.hadoop.hbase.regionserver.RegionServerAbortedException: Server oldServer,17020,1545202098577 aborting
      
      2018-12-18 23:06:42,485 INFO  [PEWorker-5] procedure2.ProcedureExecutor: Finished subprocedure(s) of pid=151104, ppid=150875, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; TransitRegionStateProcedure table=t1, region=region1, ASSIGN; resume parent processing.
      2018-12-18 23:06:42,485 INFO  [PEWorker-13] assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; pid=151104, ppid=150875, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; TransitRegionStateProcedure table=t1, region=region1, ASSIGN; rit=OPENING, location=oldServer,17020,1545202098577
      2018-12-18 23:06:42,500 INFO  [PEWorker-13] assignment.TransitRegionStateProcedure: Starting pid=151104, ppid=150875, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; TransitRegionStateProcedure table=t1, region=region1, ASSIGN; rit=OPENING, location=null; forceNewPlan=true, retain=false
      2018-12-18 23:06:42,657 INFO  [PEWorker-2] assignment.RegionStateStore: pid=151104 updating hbase:meta row=region1, regionState=OPENING, regionLocation=newServer,17020,1545202111238
      ...
      2018-12-18 23:06:43,094 INFO  [PEWorker-4] procedure.ServerCrashProcedure: pid=151632, state=RUNNABLE:SERVER_CRASH_ASSIGN, hasLock=true; ServerCrashProcedure server=oldServer,17020,1545202098577, splitWal=true, meta=false found RIT  pid=151104, ppid=150875, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; TransitRegionStateProcedure table=t1, region=region1, ASSIGN; rit=OPENING, location=newServer,17020,1545202111238, table=t1, region=region1
      2018-12-18 23:06:43,094 INFO  [PEWorker-4] assignment.RegionStateStore: pid=151104 updating hbase:meta row=region1, regionState=ABNORMALLY_CLOSED
      

      Later, the RIT overwrote the state again, it seems, and then the region got stuck in OPENING state forever, but I'm not sure yet if that's just due to this bug or if there was another bug after that. For now this can be addressed.

        Attachments

        1. HBASE-21623.patch
          2 kB
          Sergey Shelukhin

          Activity

            People

            • Assignee:
              sershe Sergey Shelukhin
              Reporter:
              sershe Sergey Shelukhin
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated: