Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-5063

State handles are not properly cleaned up for declined or expired checkpoints

    Details

      Description

      In case that a Checkpoint is declined or expires, the CheckpointCoordinator will dispose the PendingCheckpoint. Disposing the PendingCheckpoint entails that all so far registered SubtaskStates of the acknowledged Tasks are discarded. However, all late arriving acknowledge messages are simply ignored without properly discarding the transmitted state handles. This can lead to a cluttering of checkpoint directory since the checkpoint files of late or unknown acknowledge checkpoint messages are never deleted.

      I propose to properly discard the state handles at the CheckpointCoordinator if receiving a late acknowledge message or an acknowledge message for an unknown ExecutionAttemptID belonging to the job of the CheckpointCoordinator. However, checkpoint messages belonging to a different job won't be handled and simply ignored.

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user tillrohrmann opened a pull request:

          https://github.com/apache/flink/pull/2812

          FLINK-5063 Discard state handles of declined or expired state handles

          Whenever the checkpoint coordinator receives an acknowledge checkpoint message which belongs
          to the job maintained by the checkpoint coordinator, it should either record the state handles
          for later processing or discard to free the resources. The latter case can happen if a
          checkpoint has been expired and late acknowledge checkpoint messages arrive. Furthermore, it
          can happen if a Task sent a decline checkpoint message while other Tasks where still drawing
          a checkpoint. This PR changes the behaviour such that state handles belonging to the job of
          the checkpoint coordinator are discarded if they could not be added to the PendingCheckpoint.

          Review @uce, @StephanEwen

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/tillrohrmann/flink fixStateHandleCleanup

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/2812.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #2812


          commit c4c000d1b39de5617b6796eed524ce2a449100d3
          Author: Till Rohrmann <trohrmann@apache.org>
          Date: 2016-11-14T17:33:55Z

          FLINK-5063 Discard state handles of declined or expired state handles

          Whenever the checkpoint coordinator receives an acknowledge checkpoint message which belongs
          to the job maintained by the checkpoint coordinator, it should either record the state handles
          for later processing or discard to free the resources. The latter case can happen if a
          checkpoint has been expired and late acknowledge checkpoint messages arrive. Furthremore, it
          can happen if a Task sent a decline checkpoint message while other Tasks where still drawing
          a checkpoint. This PR changes the behaviour such that state handles belonging to the job of
          the checkpoint coordinator are discarded if they could not be added to the PendingCheckpoint.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/2812 FLINK-5063 Discard state handles of declined or expired state handles Whenever the checkpoint coordinator receives an acknowledge checkpoint message which belongs to the job maintained by the checkpoint coordinator, it should either record the state handles for later processing or discard to free the resources. The latter case can happen if a checkpoint has been expired and late acknowledge checkpoint messages arrive. Furthermore, it can happen if a Task sent a decline checkpoint message while other Tasks where still drawing a checkpoint. This PR changes the behaviour such that state handles belonging to the job of the checkpoint coordinator are discarded if they could not be added to the PendingCheckpoint. Review @uce, @StephanEwen You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink fixStateHandleCleanup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2812.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2812 commit c4c000d1b39de5617b6796eed524ce2a449100d3 Author: Till Rohrmann <trohrmann@apache.org> Date: 2016-11-14T17:33:55Z FLINK-5063 Discard state handles of declined or expired state handles Whenever the checkpoint coordinator receives an acknowledge checkpoint message which belongs to the job maintained by the checkpoint coordinator, it should either record the state handles for later processing or discard to free the resources. The latter case can happen if a checkpoint has been expired and late acknowledge checkpoint messages arrive. Furthremore, it can happen if a Task sent a decline checkpoint message while other Tasks where still drawing a checkpoint. This PR changes the behaviour such that state handles belonging to the job of the checkpoint coordinator are discarded if they could not be added to the PendingCheckpoint.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user tillrohrmann opened a pull request:

          https://github.com/apache/flink/pull/2813

          [backport] FLINK-5063 Discard state handles of declined or expired state handles

          This is backport of #2812 for the release-1.1 branch.

          Whenever the checkpoint coordinator receives an acknowledge checkpoint message which belongs
          to the job maintained by the checkpoint coordinator, it should either record the state handles
          for later processing or discard to free the resources. The latter case can happen if a
          checkpoint has been expired and late acknowledge checkpoint messages arrive. Furthermore, it
          can happen if a Task sent a decline checkpoint message while other Tasks where still drawing
          a checkpoint. This PR changes the behaviour such that state handles belonging to the job of
          the checkpoint coordinator are discarded if they could not be added to the PendingCheckpoint.

          Review @uce

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/tillrohrmann/flink backportFixStateHandleCleanup

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/2813.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #2813



          Show
          githubbot ASF GitHub Bot added a comment - GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/2813 [backport] FLINK-5063 Discard state handles of declined or expired state handles This is backport of #2812 for the release-1.1 branch. Whenever the checkpoint coordinator receives an acknowledge checkpoint message which belongs to the job maintained by the checkpoint coordinator, it should either record the state handles for later processing or discard to free the resources. The latter case can happen if a checkpoint has been expired and late acknowledge checkpoint messages arrive. Furthermore, it can happen if a Task sent a decline checkpoint message while other Tasks where still drawing a checkpoint. This PR changes the behaviour such that state handles belonging to the job of the checkpoint coordinator are discarded if they could not be added to the PendingCheckpoint. Review @uce You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink backportFixStateHandleCleanup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2813.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2813
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/2812

          Good catch and patch!

          +1 from my side to merge this

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2812 Good catch and patch! +1 from my side to merge this
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/2813

          Similar to #2812 - good fix!
          +1 from my side!

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2813 Similar to #2812 - good fix! +1 from my side!
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/2813

          Merging this...

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2813 Merging this...
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flink/pull/2812

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/2812
          Hide
          StephanEwen Stephan Ewen added a comment -

          Fixed in

          • 1.1.4 via 4daf3bbc1e0251e1e84d799421dae9e3fa2363fc
          • 1.2.0 via 72b295b3b52dff2d0bc5b78881826e8936c370ff
          Show
          StephanEwen Stephan Ewen added a comment - Fixed in 1.1.4 via 4daf3bbc1e0251e1e84d799421dae9e3fa2363fc 1.2.0 via 72b295b3b52dff2d0bc5b78881826e8936c370ff
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann closed the pull request at:

          https://github.com/apache/flink/pull/2813

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann closed the pull request at: https://github.com/apache/flink/pull/2813

            People

            • Assignee:
              till.rohrmann Till Rohrmann
              Reporter:
              till.rohrmann Till Rohrmann
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development