[HBASE-28669] After one RegionServer restarts, another RegionServer leaks a connection to ZooKeeper - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.4.5
Fix Version/s: 2.7.0, 3.0.0-beta-2, 2.6.1, 2.5.11
Component/s: Replication
Labels:
- Replication
- pull-request-available

Hadoop Flags:

Reviewed

Description

The peer "to_pd_A" has been removed, but there is an error log in RegionServer, error log：

2024-06-11 09:42:34.074 ERROR [ReplicationExecutor-0.replicationSource,to_pd_A-172.30.12.12,6002,1709612684705-SendThread(bjtx-hbase-onll-meta-01:2181)] client.StaticHostProvider: Unable to resolve address: bjtx-hbase-onll-meta-03:2181
java.net.UnknownHostException: bjtx-hbase-onll-meta-03
   at java.net.InetAddress$CachedAddresses.get(InetAddress.java:764)
   at java.net.InetAddress.getAllByName0(InetAddress.java:1291)
   at java.net.InetAddress.getAllByName(InetAddress.java:1144)
   at java.net.InetAddress.getAllByName(InetAddress.java:1065)
   at org.apache.zookeeper.client.StaticHostProvider$1.getAllByName(StaticHostProvider.java:92)
   at org.apache.zookeeper.client.StaticHostProvider.resolve(StaticHostProvider.java:147)
   at org.apache.zookeeper.client.StaticHostProvider.next(StaticHostProvider.java:375)
   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1137)

Here are the steps to reproduce:

I have 3 RegionServers. The following steps can reproduce the phenomenon of ZK connection leakage:
1. Enable replication
2. Create a peer
3. Shut down any two RegionServers for a few minutes and restart them
4. Print the thread stack on the RegionServer that is not shut down, search for the keyword <peerId>, and you can see that there are 4 more threads with ZooKeeper
5. By removing the peer, the extra 4 threads still exist

The following is the thread stack leak in one of my RegionServers：

"ReplicationExecutor-0.replicationSource,lizy_test_replication-10.0.16.29,6002,1718180442225-EventThread" #610 daemon prio=5 os_prio=0 cpu=0.27ms elapsed=466.94s tid=0x00007efc58179000 nid=0x5a051 waiting on condition [0x00007efc2cdef000]

"ReplicationExecutor-0.replicationSource,lizy_test_replication-10.0.16.29,6002,1718180442225-SendThread(10.0.16.100:2181)" #609 daemon prio=5 os_prio=0 cpu=3.02ms elapsed=466.94s tid=0x00007efc58178800 nid=0x5a050 runnable [0x00007efc2cef0000]

"ReplicationExecutor-0.replicationSource,lizy_test_replication-10.0.16.9,6002,1718180457260-EventThread" #505 daemon prio=5 os_prio=0 cpu=0.27ms elapsed=556.09s tid=0x00007efc50094800 nid=0x59c04 waiting on condition [0x00007efc2d7f7000]

"ReplicationExecutor-0.replicationSource,lizy_test_replication-10.0.16.9,6002,1718180457260-SendThread(10.0.16.100:2181)" #504 daemon prio=5 os_prio=0 cpu=3.72ms elapsed=556.09s tid=0x00007efc50093000 nid=0x59c03 runnable [0x00007efc2d8f8000]

Attachments

Issue Links

links to

GitHub Pull Request #6147

GitHub Pull Request #6207

Activity

People

Assignee:: ZhongYou Li

Reporter:: ZhongYou Li

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 16/Jun/24 08:07

Updated:: 07/Sep/24 07:35

Resolved:: 06/Sep/24 08:56