[FLINK-5142] Resource leak in CheckpointCoordinator - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.1.1, 1.1.2, 1.1.3
Fix Version/s: 1.1.4
Component/s: Runtime / State Backends
Labels:
None

Description

We run Flink 1.1.3 with a fairly aggressive time between checkpoints and a minimum interval between checkpoints to make sure that some work gets done between checkpoints.
Over time, the JobManager uses more and more CPU time until it saturates the available cores. It does not show heavy I/O load and the task managers seem to work without problems.
We see lots of log messages of the form "Trying to trigger another checkpoint while one was queued already" - sometimes multiple in the same millisecond.
It seems like checkpoints are triggered way too often.

I suspect there is a resource leak in the CheckpointCoordinator which leads to this behavior:

// in triggerCheckpoint(long timestamp, long nextCheckpointId), line 414ff
// introduced as part of ~~FLINK-3492~~
if (lastTriggeredCheckpoint + minPauseBetweenCheckpoints > timestamp) {
if (currentPeriodicTrigger != null)

{ currentPeriodicTrigger.cancel(); currentPeriodicTrigger = null; }

ScheduledTrigger trigger = new ScheduledTrigger();
timer.scheduleAtFixedRate(trigger, minPauseBetweenCheckpoints, baseInterval);
return false;
}

The newly created trigger is not assigned to currentPeriodicTrigger, so it cannot be cancelled whenever another rescheduling is required.
If rescheduling is common (it happens several times per minute for us), the running triggers accumulate until they overwhelm the JobManager.

Versions up to Flink 1.0.x are unaffected because ~~FLINK-3492~~ is a Flink 1.1 feature.
The issue seems to be already fixed in master by commit 8854d75c due to (unrelated) work on ~~FLINK-4322~~.

Let me know if there's anything else I can do to help.

Attachments

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Stephan Ewen

Reporter:: Frank Lauterwald

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 23/Nov/16 09:33

Updated:: 28/Nov/16 17:32

Resolved:: 28/Nov/16 17:32

Agile

View on Board

Resource leak in CheckpointCoordinator

Details

Description

Attachments

Attachments

Activity

People

Dates

Agile

Slack

Issue deployment