Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-26963

ReplicationSource#removePeer hangs if we try to remove bad peer.

    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      ReplicationSource#removePeer hangs if we try to remove bad peer.

      Steps to reproduce:
      1. Set config replication.source.regionserver.abort to false so that it doesn't abort regionserver.
      2. Add a dummy peer.
      2. Remove that peer.

      RemovePeer call will hang indefinitely until the test times out.
      Attached a patch to reproduce the above behavior.

      I can see following threads in the stack trace:

      "RS_REFRESH_PEER-regionserver/rushabh-ltmflld:0-0.replicationSource,dummypeer_1" #339 daemon prio=5 os_prio=31 tid=0x00007f8caa
      44a800 nid=0x22107 waiting on condition [0x00007000107e5000]
         java.lang.Thread.State: TIMED_WAITING (sleeping)
              at java.lang.Thread.sleep(Native Method)
              at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.sleepForRetries(ReplicationSource.java:511)
              at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.initialize(ReplicationSource.java:577)
              at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.lambda$startup$4(ReplicationSource.java:633)
              at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$$Lambda$350/89698794.uncaughtException(Unknown Source)
              at java.lang.Thread.dispatchUncaughtException(Thread.java:1959)
      
      "RS_REFRESH_PEER-regionserver/rushabh-ltmflld:0-0" #338 daemon prio=5 os_prio=31 tid=0x00007f8ca82fa800 nid=0x22307 in Object.wait() [0x00007000106e2000]
         java.lang.Thread.State: TIMED_WAITING (on object monitor)
              at java.lang.Object.wait(Native Method)
              at java.lang.Thread.join(Thread.java:1260)
              - locked <0x0000000799975ea0> (a java.lang.Thread)
              at org.apache.hadoop.hbase.util.Threads.shutdown(Threads.java:106)
              at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.terminate(ReplicationSource.java:674)
              at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.terminate(ReplicationSource.java:657)
              at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.terminate(ReplicationSource.java:652)
              at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.terminate(ReplicationSource.java:647)
              at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.removePeer(ReplicationSourceManager.java:330)
              at org.apache.hadoop.hbase.replication.regionserver.PeerProcedureHandlerImpl.removePeer(PeerProcedureHandlerImpl.java:56)
              at org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:61)
              at org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:35)
              at org.apache.hadoop.hbase.regionserver.handler.RSProcedureHandler.process(RSProcedureHandler.java:49)
              at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)
      
      "Listener at localhost/55013" #20 daemon prio=5 os_prio=31 tid=0x00007f8caf95a000 nid=0x6703 waiting on condition [0x0000700002
      544000]
         java.lang.Thread.State: TIMED_WAITING (sleeping)
              at java.lang.Thread.sleep(Native Method)
              at org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.waitProcedureResult(HBaseAdmin.java:3442)
              at org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.get(HBaseAdmin.java:3372)
              at org.apache.hadoop.hbase.util.FutureUtils.get(FutureUtils.java:182)
              at org.apache.hadoop.hbase.client.Admin.removeReplicationPeer(Admin.java:2861)
              at org.apache.hadoop.hbase.client.replication.TestBadReplicationPeer.cleanPeer(TestBadReplicationPeer.java:74)
              at org.apache.hadoop.hbase.client.replication.TestBadReplicationPeer.testWrongReplicationEndpoint(TestBadReplicationPeer.java:66)
      

      The main thread "TestBadReplicationPeer.testWrongReplicationEndpoint" is waiting for Admin#removeReplicationPeer.

      The refreshPeer thread (PeerProcedureHandlerImpl#removePeer) responsible to terminate peer (#338) is waiting on ReplicationSource thread to be terminated.

      The ReplicateSource thread (#339) is in sleeping state. Notice that this thread's stack trace is in ReplicationSource#uncaughtException method.

      When we call ReplicationSourceManager#removePeer, we set sourceRunning flag to false, send an interrupt signal to ReplicationSource thread here. In this case ReplicationSource was waiting to read cluster id of the peer and it received an InterruptedException.

      2022-04-20 08:46:49,679 WARN  [RS_REFRESH_PEER-regionserver/rushabh-ltmflld:0-0.replicationSource,dummypeer_1] zookeeper.ZKUtil(228): connection to cluster: dummypeer_1-0x100229efa200009, quorum=127.0.0.1:55599, baseZNode=/1 Unable to set watcher on znode (/1/hbaseid)
      java.lang.InterruptedException
      	at java.lang.Object.wait(Native Method)
      	at java.lang.Object.wait(Object.java:502)
      	at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1529)
      	at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1512)
      	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2016)
      	at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:212)
      	at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:221)
      	at org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
      	at org.apache.hadoop.hbase.zookeeper.ZKClusterId.getUUIDForCluster(ZKClusterId.java:96)
      	at org.apache.hadoop.hbase.replication.HBaseReplicationEndpoint.getPeerUUID(HBaseReplicationEndpoint.java:112)
      	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.initialize(ReplicationSource.java:571)
      	at java.lang.Thread.run(Thread.java:748)
      

      ZKClusterId.readClusterIdZNode catches InterruptedException and returns null.

      ReplicationSource realizes that sourceRunning flag is set to false and it will throw IllegalStateException here.

      Then the control goes to UncaughtExceptionHandler and since abortOnError is set to false, it will go into infinite sleep causing the test to hang.

      Attachments

        1. HBASE-26963.patch
          3 kB
          Rushabh Shah

        Activity

          People

            shahrs87 Rushabh Shah
            shahrs87 Rushabh Shah
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: