Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-5142

Resource leak in CheckpointCoordinator

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.1.1, 1.1.2, 1.1.3
    • Fix Version/s: 1.1.4
    • Labels:
      None

      Description

      We run Flink 1.1.3 with a fairly aggressive time between checkpoints and a minimum interval between checkpoints to make sure that some work gets done between checkpoints.
      Over time, the JobManager uses more and more CPU time until it saturates the available cores. It does not show heavy I/O load and the task managers seem to work without problems.
      We see lots of log messages of the form "Trying to trigger another checkpoint while one was queued already" - sometimes multiple in the same millisecond.
      It seems like checkpoints are triggered way too often.

      I suspect there is a resource leak in the CheckpointCoordinator which leads to this behavior:

      // in triggerCheckpoint(long timestamp, long nextCheckpointId), line 414ff
      // introduced as part of FLINK-3492
      if (lastTriggeredCheckpoint + minPauseBetweenCheckpoints > timestamp) {
      if (currentPeriodicTrigger != null)

      { currentPeriodicTrigger.cancel(); currentPeriodicTrigger = null; }

      ScheduledTrigger trigger = new ScheduledTrigger();
      timer.scheduleAtFixedRate(trigger, minPauseBetweenCheckpoints, baseInterval);
      return false;
      }

      The newly created trigger is not assigned to currentPeriodicTrigger, so it cannot be cancelled whenever another rescheduling is required.
      If rescheduling is common (it happens several times per minute for us), the running triggers accumulate until they overwhelm the JobManager.

      Versions up to Flink 1.0.x are unaffected because FLINK-3492 is a Flink 1.1 feature.
      The issue seems to be already fixed in master by commit 8854d75c due to (unrelated) work on FLINK-4322.

      Let me know if there's anything else I can do to help.

        Activity

        Hide
        StephanEwen Stephan Ewen added a comment -

        Thanks for reporting and diagnosing that.
        I'll have a look at it...

        Show
        StephanEwen Stephan Ewen added a comment - Thanks for reporting and diagnosing that. I'll have a look at it...
        Hide
        StephanEwen Stephan Ewen added a comment -

        Fixed via e2c53cf85c1af73c040d96dbd24b9e2cf3e8cdf6

        Show
        StephanEwen Stephan Ewen added a comment - Fixed via e2c53cf85c1af73c040d96dbd24b9e2cf3e8cdf6

          People

          • Assignee:
            StephanEwen Stephan Ewen
            Reporter:
            frank.lauterwald Frank Lauterwald
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development