Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
1.11.3, 1.12.1, 1.13.0
-
flink version:1.12.1
run :yarn session
job type:mock source -> regular join
checkpoint interval: 3m
Taskmanager memory : 16G
Description
I run a regular join job with flink_1.12.1 , and find taskmanager native memory not release timely after restart cause by exceeded checkpoint tolerable failure threshold.
problem job information:
- job first restart cause by exceeded checkpoint tolerable failure threshold.
- then taskmanager be killed by yarn many times
- in this case,tm heap is set to 7.68G,bug all tm heap size is under 4.2G
- nonheap size increase after restart,but still under 160M.
- taskmanager process memory increase 3-4G after restart(this figure show one of taskmanager)
my guess:
RocksDB wiki mentioned :Many of the Java Objects used in the RocksJava API will be backed by C++ objects for which the Java Objects have ownership. As C++ has no notion of automatic garbage collection for its heap in the way that Java does, we must explicitly free the memory used by the C++ objects when we are finished with them.
So, is it possible that RocksDBStateBackend not call AbstractNativeReference#close() to release memory use by RocksDB C++ Object ?
I make a change:
Actively call System.gc() and System.runFinalization() every minute.
And run this test again:
- taskmanager process memory no obvious increase
- job run for several days,and restart many times,but no taskmanager killed by yarn like before
Summary:
- first,there is some native memory can not release timely after restart in this situation
- I guess it maybe RocksDB C++ object,but I hive not check it from source code of RocksDBStateBackend
Attachments
Attachments
Issue Links
- links to