[FLINK-21986] taskmanager native memory not release timely after restart - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.11.3, 1.12.1, 1.13.0
Fix Version/s: 1.11.4, 1.13.0, 1.12.3
Component/s: Runtime / State Backends
Labels:
- pull-request-available
Environment:

flink version：1.12.1
run ：yarn session
job type：mock source -> regular join

checkpoint interval: 3m
Taskmanager memory : 16G

Description

I run a regular join job with flink_1.12.1 , and find taskmanager native memory not release timely after restart cause by exceeded checkpoint tolerable failure threshold.

problem job information：

job first restart cause by exceeded checkpoint tolerable failure threshold.
then taskmanager be killed by yarn many times
in this case，tm heap is set to 7.68G，bug all tm heap size is under 4.2G
nonheap size increase after restart，but still under 160M.
taskmanager process memory increase 3-4G after restart（this figure show one of taskmanager）

my guess：

RocksDB wiki mentioned ：Many of the Java Objects used in the RocksJava API will be backed by C++ objects for which the Java Objects have ownership. As C++ has no notion of automatic garbage collection for its heap in the way that Java does, we must explicitly free the memory used by the C++ objects when we are finished with them.

So, is it possible that RocksDBStateBackend not call AbstractNativeReference#close() to release memory use by RocksDB C++ Object ?

I make a change:

Actively call System.gc() and System.runFinalization() every minute.

And run this test again:

taskmanager process memory no obvious increase
job run for several days，and restart many times，but no taskmanager killed by yarn like before

Summary：

first，there is some native memory can not release timely after restart in this situation
I guess it maybe RocksDB C++ object，but I hive not check it from source code of RocksDBStateBackend

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

82544.svg
31/Mar/21 08:45
92 kB
Feifan Wang
image-2021-03-25-15-53-44-214.png
25/Mar/21 07:53
708 kB
Feifan Wang
image-2021-03-25-16-07-29-083.png
25/Mar/21 08:07
104 kB
Feifan Wang
image-2021-03-26-11-46-06-828.png
26/Mar/21 03:46
107 kB
Feifan Wang
image-2021-03-26-11-47-21-388.png
26/Mar/21 03:47
107 kB
Feifan Wang

Issue Links

links to

GitHub Pull Request #15619

Activity

People

Assignee:: Feifan Wang

Reporter:: Feifan Wang

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 26/Mar/21 03:59

Updated:: 10/Oct/21 19:10

Resolved:: 20/Apr/21 08:33