[HBASE-26088] conn.getBufferedMutator(tableName) leaks thread executors and other problems - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 2.0.0
Fix Version/s: 2.5.0, 2.3.6, 2.4.5
Component/s: Client
Labels:
None

Hadoop Flags:

Reviewed
Release Note:

Hide
The API doc for Connection#getBufferedMutator(TableName) and Connection#getBufferedMutator(BufferedMutatorParams) mentioned that when user dont pass a ThreadPool to be used, we use the ThreadPool in the Connection. But in reality, we were creating new ThreadPool in such cases.

We are keeping the behaviour of code as is but corrected the Javadoc and also a bug of not closing this new pool while Closing the BufferedMutator.

Show
The API doc for Connection#getBufferedMutator(TableName) and Connection#getBufferedMutator(BufferedMutatorParams) mentioned that when user dont pass a ThreadPool to be used, we use the ThreadPool in the Connection. But in reality, we were creating new ThreadPool in such cases. We are keeping the behaviour of code as is but corrected the Javadoc and also a bug of not closing this new pool while Closing the BufferedMutator.

Description

TL;DR: conn.getBufferedMutator(tableName) is dangerous in hbase client 2.4.4 and doesn't match documented behavior in 1.4.13.

To work around the problems until fixed do this:

var mySingletonPool = HTable.getDefaultExecutor(hbaseConf);
var params = new BufferedMutatorParams(tableName);
params.pool(mySingletonPool);
var myMutator = conn.getBufferedMutator(params);

And avoid code like this:

var myMutator = conn.getBufferedMutator(tableName);

The full story:

My application started leaking threads after upgrading from hbase client 1.4.13 to 2.4.4. So much so that after less than a minute of runtime more that 30k threads are leaked and all available virtual memory on the box (> 50 GB) is consumed. Other processes on the box start crashing with memory allocation errors. Even running ls at the shell fails with OS resource allocation failures.

A thread dump after just a few seconds of runtime shows thousands of threads like this:

"htable-pool-0" #8841 prio=5 os_prio=0 cpu=0.15ms elapsed=7.49s tid=0x00007efb6d2a1000 nid=0x57d2 waiting on condition [0x00007ef8a6c38000]
 java.lang.Thread.State: TIMED_WAITING (parking)
 at jdk.internal.misc.Unsafe.park(java.base@11.0.6/Native Method)
 - parking to wait for <0x00000007e7cd6188> (a java.util.concurrent.SynchronousQueue$TransferStack)
 at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.6/LockSupport.java:234)
 at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(java.base@11.0.6/SynchronousQueue.java:462)
 at java.util.concurrent.SynchronousQueue$TransferStack.transfer(java.base@11.0.6/SynchronousQueue.java:361)
 at java.util.concurrent.SynchronousQueue.poll(java.base@11.0.6/SynchronousQueue.java:937)
 at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.6/ThreadPoolExecutor.java:1053)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.6/ThreadPoolExecutor.java:1114)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.6/ThreadPoolExecutor.java:628)
 at java.lang.Thread.run(java.base@11.0.6/Thread.java:834)

Note: All the threads are labeled htable-pool-0. That suggests we're leaking thread executors not just threads. The htable-pool part indicates the problem is to do with HTable.getDefaultExecutor(conf) and the only part of my code that interacts with that is a call to conn.getBufferedMutator(tableName).

Looking at the hbase client code shows a few problems:

1) Neither 1.4.13 nor 2.4.4's behavior matches the documentation for conn.getBufferedMutator(tableName) which says:

This BufferedMutator will use the Connection's ExecutorService.

That suggests some singleton thread executor is being used which is not the case.

2) Under 1.4.13 you get a new ThreadPoolExecutor for every BufferedMutator. That's probably not what you want but you likely won't notice. I didn't. It's a code path I hadn't profiled much.

3) Under 2.4.4 you get a new ThreadPoolExecutor for every BufferedMutator and that ThreadPoolExecutor is not cleaned up after the Mutator is closed. Each completed ThreadPoolExecutor carries with it one thread which hangs around until a timeout value which defaults to 60 seconds.

My application creates one BufferedMutator for every incoming stream and there are lots of streams, some of them are short lived so my code leaks threads fast under 2.4.4.

Here's the part where a new executor is created for every BufferedMutator (it's similar for 1.4.13):

https://github.com/apache/hbase/blob/branch-2.4/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java#L420

The reason for the leak in 2.4.4 is the should-we/shouldn't-we cleanup logic added here:

https://github.com/apache/hbase/blob/branch-2.4/hbase-client/src/main/java/org/apache/hadoop/hbase/client/BufferedMutatorImpl.java#L104

That might be ok if pool was being initialized there but in the conn.getBufferedMutator(tableName) code path it's not. pool is initialized in conn.getBufferedMutator itself so the executor cleanup code never runs.

Attachments

Issue Links

links to

GitHub Pull Request #3506

Sub-Tasks

1.

conn.getBufferedMutator(tableName) leaks thread executors and other problems (for master branch)

Resolved

Rushabh Shah

conn.getBufferedMutator(tableName) leaks thread executors and other problems

Details

Description

Attachments

Issue Links

Sub-Tasks

Activity

People

Dates