Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-24564

Make RS abort call idempotent

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.0.0-alpha-1, 2.3.0, 1.7.0
    • Fix Version/s: 3.0.0-alpha-1, 1.7.0, 2.4.0
    • Component/s: regionserver
    • Labels:
      None

      Description

      We noticed this in our deployment based on branch-1, but it affects other branches too.

      1. abort() is not idempotent. There can be multiple aborts that can un-necessarily complicate the state machine. Following is the timeline of actions.

      • HMaster detected that the RS lost its ZK session and started the SCP. This was caused by ZK flakiness
      2020-06-11 01:08:39,110 DEBUG [ProcedureExecutor-34] master.DeadServer - Started processing foo,60020,1591683150711; numProcessing=2
      2020-06-11 01:08:39,110 INFO [ProcedureExecutor-34] procedure.ServerCrashProcedure - Start processing crashed foo,60020,1591683150711
      
      • RS wakes up and attempts to report to master and receives a YouAreDead... This triggers an abort
      Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.YouAreDeadException): org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing foo,60020,1591683150711 as dead server
              at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:438)
              at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:343)
              at org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:359)
              at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:8617)
              at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2421)
              at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
              at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:311)
              at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:291)
      
              at org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:390)
              at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:94)
              at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:413)
              at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:409)
              at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103)
              at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118)
              at org.apache.hadoop.hbase.ipc.BlockingRpcConnection.readResponse(BlockingRpcConnection.java:600)
              at org.apache.hadoop.hbase.ipc.BlockingRpcConnection.run(BlockingRpcConnection.java:334)
              ... 1 more
      
      • After a few seconds, RS also realizes that it lost the ZK session and initiates a second abort
      2020-06-11 01:08:50,321 FATAL [main-EventThread] regionserver.HRegionServer - ABORTING region server foo,60020,1591683150711: regionserver:60020-0x1725cd18ff3c55f, quorum=foo:2181,bar:2181,baz:2181 baseZNode=/hbase regionserver:60020-0x1725cd18ff3c55f received expired from ZooKeeper, aborting
      org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired
              at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:697)
              at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:629)
              at org.apache.hadoop.hbase.zookeeper.PendingWatcher.process(PendingWatcher.java:40)
              at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:544)
              at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:519)
      

      Overall, there were two sequences of aborts running at the same time. This can be avoided by making abort idempotent.

      2. Abort timeout task doesn't init as expected. (edited, see comments)

        Attachments

          Activity

            People

            • Assignee:
              bharathv Bharath Vissapragada
              Reporter:
              bharathv Bharath Vissapragada
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: