[FLINK-25027] Allow GC of a finished job's JobMaster before the slot timeout is reached - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.14.0, 1.12.5, 1.13.3
Fix Version/s: 1.20.0
Component/s: Runtime / Coordination
Labels:
None

Description

In a session cluster, after a (batch) job is finished, the JobMaster seems to stick around for another couple of minutes before being eligible for garbage collection.

Looking into a heap dump, it seems to be tied to a PhysicalSlotRequestBulkCheckerImpl which is enqueued in the underlying Akka executor (and keeps the JM from being GC’d). Per default the action is scheduled for slot.request.timeout that defaults to 5 min (thanks trohrmann for helping out here)

With this setting, you will have to account for enough metaspace to cover 5 minutes of time which may span a couple of jobs, needlessly!

The problem seems to be that Flink is using the main thread executor for the scheduling that uses the ActorSystem's scheduler and the future task scheduled with Akka can (probably) not be easily cancelled.
One idea could be to use a dedicated thread pool per JM, that we shut down when the JM terminates. That way we would not keep the JM from being GC’d.

(The concrete example we investigated was a DataSet job)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2021-11-23-20-32-20-479.png
23/Nov/21 19:32
325 kB
Nico Kruber

Issue Links

is a child of

FLINK-25318 Improvement of scheduler and execution for Flink OLAP

Open

Sub-Tasks

1.	Add a scheduled thread pool in Endpoint and close it when the endpoint is stopped	Closed	Fang Yong
2.	Move resource timeout checkers in JM from akka executors to the dedicated thread pool	Open	Unassigned
3.	Move heartbeats from akka to the dedicated thread pool in JM	Open	Zhanghao Chen

Activity

People

Assignee:: Fang Yong

Reporter:: Nico Kruber

Votes:: 0 Vote for this issue

Watchers:: 16 Start watching this issue

Dates

Created:: 23/Nov/21 19:41

Updated:: 11/Mar/24 12:44