[HBASE-26963] ReplicationSource#removePeer hangs if we try to remove bad peer. - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.5.0, 3.0.0-alpha-2, 2.4.11
Fix Version/s: 2.5.0, 3.0.0-alpha-3, 2.4.13
Component/s: regionserver, Replication
Labels:
None

Hadoop Flags:

Reviewed

Description

ReplicationSource#removePeer hangs if we try to remove bad peer.

Steps to reproduce:
1. Set config replication.source.regionserver.abort to false so that it doesn't abort regionserver.
2. Add a dummy peer.
2. Remove that peer.

RemovePeer call will hang indefinitely until the test times out.
Attached a patch to reproduce the above behavior.

I can see following threads in the stack trace:

"RS_REFRESH_PEER-regionserver/rushabh-ltmflld:0-0.replicationSource,dummypeer_1" #339 daemon prio=5 os_prio=31 tid=0x00007f8caa
44a800 nid=0x22107 waiting on condition [0x00007000107e5000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.sleepForRetries(ReplicationSource.java:511)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.initialize(ReplicationSource.java:577)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.lambda$startup$4(ReplicationSource.java:633)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$$Lambda$350/89698794.uncaughtException(Unknown Source)
        at java.lang.Thread.dispatchUncaughtException(Thread.java:1959)

"RS_REFRESH_PEER-regionserver/rushabh-ltmflld:0-0" #338 daemon prio=5 os_prio=31 tid=0x00007f8ca82fa800 nid=0x22307 in Object.wait() [0x00007000106e2000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1260)
        - locked <0x0000000799975ea0> (a java.lang.Thread)
        at org.apache.hadoop.hbase.util.Threads.shutdown(Threads.java:106)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.terminate(ReplicationSource.java:674)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.terminate(ReplicationSource.java:657)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.terminate(ReplicationSource.java:652)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.terminate(ReplicationSource.java:647)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.removePeer(ReplicationSourceManager.java:330)
        at org.apache.hadoop.hbase.replication.regionserver.PeerProcedureHandlerImpl.removePeer(PeerProcedureHandlerImpl.java:56)
        at org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:61)
        at org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:35)
        at org.apache.hadoop.hbase.regionserver.handler.RSProcedureHandler.process(RSProcedureHandler.java:49)
        at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

"Listener at localhost/55013" #20 daemon prio=5 os_prio=31 tid=0x00007f8caf95a000 nid=0x6703 waiting on condition [0x0000700002
544000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.waitProcedureResult(HBaseAdmin.java:3442)
        at org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.get(HBaseAdmin.java:3372)
        at org.apache.hadoop.hbase.util.FutureUtils.get(FutureUtils.java:182)
        at org.apache.hadoop.hbase.client.Admin.removeReplicationPeer(Admin.java:2861)
        at org.apache.hadoop.hbase.client.replication.TestBadReplicationPeer.cleanPeer(TestBadReplicationPeer.java:74)
        at org.apache.hadoop.hbase.client.replication.TestBadReplicationPeer.testWrongReplicationEndpoint(TestBadReplicationPeer.java:66)

The main thread "TestBadReplicationPeer.testWrongReplicationEndpoint" is waiting for Admin#removeReplicationPeer.

The refreshPeer thread (PeerProcedureHandlerImpl#removePeer) responsible to terminate peer (#338) is waiting on ReplicationSource thread to be terminated.

The ReplicateSource thread (#339) is in sleeping state. Notice that this thread's stack trace is in ReplicationSource#uncaughtException method.

When we call ReplicationSourceManager#removePeer, we set sourceRunning flag to false, send an interrupt signal to ReplicationSource thread here. In this case ReplicationSource was waiting to read cluster id of the peer and it received an InterruptedException.

2022-04-20 08:46:49,679 WARN  [RS_REFRESH_PEER-regionserver/rushabh-ltmflld:0-0.replicationSource,dummypeer_1] zookeeper.ZKUtil(228): connection to cluster: dummypeer_1-0x100229efa200009, quorum=127.0.0.1:55599, baseZNode=/1 Unable to set watcher on znode (/1/hbaseid)
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Object.wait(Object.java:502)
	at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1529)
	at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1512)
	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2016)
	at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:212)
	at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:221)
	at org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
	at org.apache.hadoop.hbase.zookeeper.ZKClusterId.getUUIDForCluster(ZKClusterId.java:96)
	at org.apache.hadoop.hbase.replication.HBaseReplicationEndpoint.getPeerUUID(HBaseReplicationEndpoint.java:112)
	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.initialize(ReplicationSource.java:571)
	at java.lang.Thread.run(Thread.java:748)

ZKClusterId.readClusterIdZNode catches InterruptedException and returns null.

ReplicationSource realizes that sourceRunning flag is set to false and it will throw IllegalStateException here.

Then the control goes to UncaughtExceptionHandler and since abortOnError is set to false, it will go into infinite sleep causing the test to hang.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HBASE-26963.patch
20/Apr/22 16:56
3 kB
Rushabh Shah

Issue Links

links to

GitHub Pull Request #4361

GitHub Pull Request #4413

ReplicationSource#removePeer hangs if we try to remove bad peer.

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates