Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-26963

ReplicationSource#removePeer hangs if we try to remove bad peer.

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      ReplicationSource#removePeer hangs if we try to remove bad peer.

      Steps to reproduce:
      1. Set config replication.source.regionserver.abort to false so that it doesn't abort regionserver.
      2. Add a dummy peer.
      2. Remove that peer.

      RemovePeer call will hang indefinitely until the test times out.
      Attached a patch to reproduce the above behavior.

      I can see following threads in the stack trace:

      "RS_REFRESH_PEER-regionserver/rushabh-ltmflld:0-0.replicationSource,dummypeer_1" #339 daemon prio=5 os_prio=31 tid=0x00007f8caa
      44a800 nid=0x22107 waiting on condition [0x00007000107e5000]
         java.lang.Thread.State: TIMED_WAITING (sleeping)
              at java.lang.Thread.sleep(Native Method)
              at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.sleepForRetries(ReplicationSource.java:511)
              at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.initialize(ReplicationSource.java:577)
              at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.lambda$startup$4(ReplicationSource.java:633)
              at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$$Lambda$350/89698794.uncaughtException(Unknown Source)
              at java.lang.Thread.dispatchUncaughtException(Thread.java:1959)
      
      "RS_REFRESH_PEER-regionserver/rushabh-ltmflld:0-0" #338 daemon prio=5 os_prio=31 tid=0x00007f8ca82fa800 nid=0x22307 in Object.wait() [0x00007000106e2000]
         java.lang.Thread.State: TIMED_WAITING (on object monitor)
              at java.lang.Object.wait(Native Method)
              at java.lang.Thread.join(Thread.java:1260)
              - locked <0x0000000799975ea0> (a java.lang.Thread)
              at org.apache.hadoop.hbase.util.Threads.shutdown(Threads.java:106)
              at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.terminate(ReplicationSource.java:674)
              at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.terminate(ReplicationSource.java:657)
              at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.terminate(ReplicationSource.java:652)
              at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.terminate(ReplicationSource.java:647)
              at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.removePeer(ReplicationSourceManager.java:330)
              at org.apache.hadoop.hbase.replication.regionserver.PeerProcedureHandlerImpl.removePeer(PeerProcedureHandlerImpl.java:56)
              at org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:61)
              at org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:35)
              at org.apache.hadoop.hbase.regionserver.handler.RSProcedureHandler.process(RSProcedureHandler.java:49)
              at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)
      
      "Listener at localhost/55013" #20 daemon prio=5 os_prio=31 tid=0x00007f8caf95a000 nid=0x6703 waiting on condition [0x0000700002
      544000]
         java.lang.Thread.State: TIMED_WAITING (sleeping)
              at java.lang.Thread.sleep(Native Method)
              at org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.waitProcedureResult(HBaseAdmin.java:3442)
              at org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.get(HBaseAdmin.java:3372)
              at org.apache.hadoop.hbase.util.FutureUtils.get(FutureUtils.java:182)
              at org.apache.hadoop.hbase.client.Admin.removeReplicationPeer(Admin.java:2861)
              at org.apache.hadoop.hbase.client.replication.TestBadReplicationPeer.cleanPeer(TestBadReplicationPeer.java:74)
              at org.apache.hadoop.hbase.client.replication.TestBadReplicationPeer.testWrongReplicationEndpoint(TestBadReplicationPeer.java:66)
      

      The main thread "TestBadReplicationPeer.testWrongReplicationEndpoint" is waiting for Admin#removeReplicationPeer.

      The refreshPeer thread (PeerProcedureHandlerImpl#removePeer) responsible to terminate peer (#338) is waiting on ReplicationSource thread to be terminated.

      The ReplicateSource thread (#339) is in sleeping state. Notice that this thread's stack trace is in ReplicationSource#uncaughtException method.

      When we call ReplicationSourceManager#removePeer, we set sourceRunning flag to false, send an interrupt signal to ReplicationSource thread here. In this case ReplicationSource was waiting to read cluster id of the peer and it received an InterruptedException.

      2022-04-20 08:46:49,679 WARN  [RS_REFRESH_PEER-regionserver/rushabh-ltmflld:0-0.replicationSource,dummypeer_1] zookeeper.ZKUtil(228): connection to cluster: dummypeer_1-0x100229efa200009, quorum=127.0.0.1:55599, baseZNode=/1 Unable to set watcher on znode (/1/hbaseid)
      java.lang.InterruptedException
      	at java.lang.Object.wait(Native Method)
      	at java.lang.Object.wait(Object.java:502)
      	at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1529)
      	at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1512)
      	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2016)
      	at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:212)
      	at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:221)
      	at org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
      	at org.apache.hadoop.hbase.zookeeper.ZKClusterId.getUUIDForCluster(ZKClusterId.java:96)
      	at org.apache.hadoop.hbase.replication.HBaseReplicationEndpoint.getPeerUUID(HBaseReplicationEndpoint.java:112)
      	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.initialize(ReplicationSource.java:571)
      	at java.lang.Thread.run(Thread.java:748)
      

      ZKClusterId.readClusterIdZNode catches InterruptedException and returns null.

      ReplicationSource realizes that sourceRunning flag is set to false and it will throw IllegalStateException here.

      Then the control goes to UncaughtExceptionHandler and since abortOnError is set to false, it will go into infinite sleep causing the test to hang.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            shahrs87 Rushabh Shah Assign to me
            shahrs87 Rushabh Shah
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment