In AbstractFSWAL(FSHLog in branch-1), we have a map caches thread and SyncFutures.
A colleague of mine find a memory leak case caused by this map.
Every thread who writes WAL will be cached in this map, And no one will clean the threads in the map even after the thread is dead.
In one of our customer's cluster, we noticed that even though there is no requests, the heap of the RS is almost full and CMS GC was triggered every second.
We dumped the heap and then found out there were more than 30 thousands threads with Terminated state. which are all cached in this map above. Everything referenced in these threads were leaked. Most of the threads are:
1.PostOpenDeployTasksThread, which will write Open Region mark in WAL
2. hconnection-0x1f838e31-shared--pool, which are used to write index short circuit(Phoenix), and WAL will be write and sync in these threads.
3. Index writer thread(Phoenix), which referenced by RegionCoprocessorHost$RegionEnvironment then by HRegion and finally been referenced by PostOpenDeployTasksThread.
We should turn this map into a thread local one, let JVM GC the terminated thread for us.