[HBASE-24564] Make RS abort call idempotent - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.0.0-alpha-1, 2.3.0, 1.7.0
Fix Version/s: 3.0.0-alpha-1, 1.7.0, 2.4.0
Component/s: regionserver
Labels:
None

Description

We noticed this in our deployment based on branch-1, but it affects other branches too.

1. abort() is not idempotent. There can be multiple aborts that can un-necessarily complicate the state machine. Following is the timeline of actions.

HMaster detected that the RS lost its ZK session and started the SCP. This was caused by ZK flakiness

2020-06-11 01:08:39,110 DEBUG [ProcedureExecutor-34] master.DeadServer - Started processing foo,60020,1591683150711; numProcessing=2
2020-06-11 01:08:39,110 INFO [ProcedureExecutor-34] procedure.ServerCrashProcedure - Start processing crashed foo,60020,1591683150711

RS wakes up and attempts to report to master and receives a YouAreDead... This triggers an abort

Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.YouAreDeadException): org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing foo,60020,1591683150711 as dead server
        at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:438)
        at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:343)
        at org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:359)
        at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:8617)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2421)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:311)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:291)

        at org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:390)
        at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:94)
        at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:413)
        at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:409)
        at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103)
        at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118)
        at org.apache.hadoop.hbase.ipc.BlockingRpcConnection.readResponse(BlockingRpcConnection.java:600)
        at org.apache.hadoop.hbase.ipc.BlockingRpcConnection.run(BlockingRpcConnection.java:334)
        ... 1 more

After a few seconds, RS also realizes that it lost the ZK session and initiates a second abort

2020-06-11 01:08:50,321 FATAL [main-EventThread] regionserver.HRegionServer - ABORTING region server foo,60020,1591683150711: regionserver:60020-0x1725cd18ff3c55f, quorum=foo:2181,bar:2181,baz:2181 baseZNode=/hbase regionserver:60020-0x1725cd18ff3c55f received expired from ZooKeeper, aborting
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:697)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:629)
        at org.apache.hadoop.hbase.zookeeper.PendingWatcher.process(PendingWatcher.java:40)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:544)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:519)

Overall, there were two sequences of aborts running at the same time. This can be avoided by making abort idempotent.

~~2. Abort timeout task doesn't init as expected.~~ (edited, see comments)

Attachments

Issue Links

links to

GitHub Pull Request #1905

GitHub Pull Request #1910

GitHub Pull Request #1911

Activity

People

Assignee:: Bharath Vissapragada

Reporter:: Bharath Vissapragada

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 15/Jun/20 19:20

Updated:: 18/Jun/20 10:48

Resolved:: 16/Jun/20 16:17