Details
Description
TL;DR: conn.getBufferedMutator(tableName) is dangerous in hbase client 2.4.4 and doesn't match documented behavior in 1.4.13.
To work around the problems until fixed do this:
var mySingletonPool = HTable.getDefaultExecutor(hbaseConf); var params = new BufferedMutatorParams(tableName); params.pool(mySingletonPool); var myMutator = conn.getBufferedMutator(params);
And avoid code like this:
var myMutator = conn.getBufferedMutator(tableName);
The full story:
My application started leaking threads after upgrading from hbase client 1.4.13 to 2.4.4. So much so that after less than a minute of runtime more that 30k threads are leaked and all available virtual memory on the box (> 50 GB) is consumed. Other processes on the box start crashing with memory allocation errors. Even running ls at the shell fails with OS resource allocation failures.
A thread dump after just a few seconds of runtime shows thousands of threads like this:
"htable-pool-0" #8841 prio=5 os_prio=0 cpu=0.15ms elapsed=7.49s tid=0x00007efb6d2a1000 nid=0x57d2 waiting on condition [0x00007ef8a6c38000] java.lang.Thread.State: TIMED_WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@11.0.6/Native Method) - parking to wait for <0x00000007e7cd6188> (a java.util.concurrent.SynchronousQueue$TransferStack) at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.6/LockSupport.java:234) at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(java.base@11.0.6/SynchronousQueue.java:462) at java.util.concurrent.SynchronousQueue$TransferStack.transfer(java.base@11.0.6/SynchronousQueue.java:361) at java.util.concurrent.SynchronousQueue.poll(java.base@11.0.6/SynchronousQueue.java:937) at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.6/ThreadPoolExecutor.java:1053) at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.6/ThreadPoolExecutor.java:1114) at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.6/ThreadPoolExecutor.java:628) at java.lang.Thread.run(java.base@11.0.6/Thread.java:834)
Note: All the threads are labeled htable-pool-0. That suggests we're leaking thread executors not just threads. The htable-pool part indicates the problem is to do with HTable.getDefaultExecutor(conf) and the only part of my code that interacts with that is a call to conn.getBufferedMutator(tableName).
Looking at the hbase client code shows a few problems:
1) Neither 1.4.13 nor 2.4.4's behavior matches the documentation for conn.getBufferedMutator(tableName) which says:
This BufferedMutator will use the Connection's ExecutorService.
That suggests some singleton thread executor is being used which is not the case.
2) Under 1.4.13 you get a new ThreadPoolExecutor for every BufferedMutator. That's probably not what you want but you likely won't notice. I didn't. It's a code path I hadn't profiled much.
3) Under 2.4.4 you get a new ThreadPoolExecutor for every BufferedMutator and that ThreadPoolExecutor is not cleaned up after the Mutator is closed. Each completed ThreadPoolExecutor carries with it one thread which hangs around until a timeout value which defaults to 60 seconds.
My application creates one BufferedMutator for every incoming stream and there are lots of streams, some of them are short lived so my code leaks threads fast under 2.4.4.
Here's the part where a new executor is created for every BufferedMutator (it's similar for 1.4.13):
The reason for the leak in 2.4.4 is the should-we/shouldn't-we cleanup logic added here:
That might be ok if pool was being initialized there but in the conn.getBufferedMutator(tableName) code path it's not. pool is initialized in conn.getBufferedMutator itself so the executor cleanup code never runs.
Attachments
Issue Links
- links to