Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-5073

ZooKeeperCompleteCheckpointStore executes blocking delete operation in ZooKeeper client thread

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.2.0, 1.1.3
    • Fix Version/s: 1.2.0, 1.1.4
    • Labels:
      None

      Description

      When deleting completed checkpoints from the ZooKeeperCompletedCheckpointStore, one first tries to delete the meta state handle from ZooKeeper and then deletes the actual checkpoint in a callback from the delete operation. This callback is executed by the ZooKeeper client's main thread which is problematic, because it blocks the ZooKeeper client. If a delete operation takes longer than it takes to complete a checkpoint, then it might even happen that delete operations of outdated checkpoints are piling up because they are effectively executed sequentially.

      I propose to execute the delete operations by a dedicated Executor so that we keep the client's main thread free to do ZooKeeper related work.

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user tillrohrmann opened a pull request:

          https://github.com/apache/flink/pull/2815

          FLINK-5073 Use Executor to run ZooKeeper callbacks in ZooKeeperStateHandleStore

          Use dedicated Executor to run ZooKeeper callbacks in ZooKeeperStateHandleStore instead
          of running it in the ZooKeeper client's thread. The callback can be blocking because it
          discards state which might entail deleting files from disk.

          Review @uce.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/tillrohrmann/flink fixZooKeeperDelete

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/2815.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #2815


          commit 63ec894de907f38ffff362572c87aef1808062d0
          Author: Till Rohrmann <trohrmann@apache.org>
          Date: 2016-11-15T21:45:04Z

          FLINK-5073 Use Executor to run ZooKeeper callbacks in ZooKeeperStateHandleStore

          Use dedicated Executor to run ZooKeeper callbacks in ZooKeeperStateHandleStore instead
          of running it in the ZooKeeper client's thread. The callback can be blocking because it
          discards state which might entail deleting files from disk.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/2815 FLINK-5073 Use Executor to run ZooKeeper callbacks in ZooKeeperStateHandleStore Use dedicated Executor to run ZooKeeper callbacks in ZooKeeperStateHandleStore instead of running it in the ZooKeeper client's thread. The callback can be blocking because it discards state which might entail deleting files from disk. Review @uce. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink fixZooKeeperDelete Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2815.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2815 commit 63ec894de907f38ffff362572c87aef1808062d0 Author: Till Rohrmann <trohrmann@apache.org> Date: 2016-11-15T21:45:04Z FLINK-5073 Use Executor to run ZooKeeper callbacks in ZooKeeperStateHandleStore Use dedicated Executor to run ZooKeeper callbacks in ZooKeeperStateHandleStore instead of running it in the ZooKeeper client's thread. The callback can be blocking because it discards state which might entail deleting files from disk.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user tillrohrmann opened a pull request:

          https://github.com/apache/flink/pull/2816

          [backport] FLINK-5073 Use Executor to run ZooKeeper callbacks in ZooKeeperStateHandleStore

          Backport of #2815 for the release-1.1 branch.

          Use dedicated Executor to run ZooKeeper callbacks in ZooKeeperStateHandleStore instead
          of running it in the ZooKeeper client's thread. The callback can be blocking because it
          discards state which might entail deleting files from disk.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/tillrohrmann/flink backportFixZooKeeperDelete

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/2816.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #2816


          commit ae67fe9a3bbc768911b8eab8dc32d18c2cb10c1a
          Author: Till Rohrmann <trohrmann@apache.org>
          Date: 2016-11-15T21:45:04Z

          FLINK-5073 Use Executor to run ZooKeeper callbacks in ZooKeeperStateHandleStore

          Use dedicated Executor to run ZooKeeper callbacks in ZooKeeperStateHandleStore instead
          of running it in the ZooKeeper client's thread. The callback can be blocking because it
          discards state which might entail deleting files from disk.

          Add TestExecutors


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/2816 [backport] FLINK-5073 Use Executor to run ZooKeeper callbacks in ZooKeeperStateHandleStore Backport of #2815 for the release-1.1 branch. Use dedicated Executor to run ZooKeeper callbacks in ZooKeeperStateHandleStore instead of running it in the ZooKeeper client's thread. The callback can be blocking because it discards state which might entail deleting files from disk. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink backportFixZooKeeperDelete Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2816.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2816 commit ae67fe9a3bbc768911b8eab8dc32d18c2cb10c1a Author: Till Rohrmann <trohrmann@apache.org> Date: 2016-11-15T21:45:04Z FLINK-5073 Use Executor to run ZooKeeper callbacks in ZooKeeperStateHandleStore Use dedicated Executor to run ZooKeeper callbacks in ZooKeeperStateHandleStore instead of running it in the ZooKeeper client's thread. The callback can be blocking because it discards state which might entail deleting files from disk. Add TestExecutors
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user zentol commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2816#discussion_r88204552

          — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/util/ZooKeeperUtils.java —
          @@ -237,11 +242,12 @@ public static CompletedCheckpointStore createCompletedCheckpoints(
          checkpointsPath += ZooKeeperSubmittedJobGraphStore.getPathForJob(jobId);

          return new ZooKeeperCompletedCheckpointStore(

          • maxNumberOfCheckpointsToRetain,
            + maxNumberOfCheckpointsToRetain,
            userClassLoader,
              • End diff –

          inconsistent indent

          Show
          githubbot ASF GitHub Bot added a comment - Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/2816#discussion_r88204552 — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/util/ZooKeeperUtils.java — @@ -237,11 +242,12 @@ public static CompletedCheckpointStore createCompletedCheckpoints( checkpointsPath += ZooKeeperSubmittedJobGraphStore.getPathForJob(jobId); return new ZooKeeperCompletedCheckpointStore( maxNumberOfCheckpointsToRetain, + maxNumberOfCheckpointsToRetain, userClassLoader, End diff – inconsistent indent
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2816#discussion_r88231145

          — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/util/ZooKeeperUtils.java —
          @@ -237,11 +242,12 @@ public static CompletedCheckpointStore createCompletedCheckpoints(
          checkpointsPath += ZooKeeperSubmittedJobGraphStore.getPathForJob(jobId);

          return new ZooKeeperCompletedCheckpointStore(

          • maxNumberOfCheckpointsToRetain,
            + maxNumberOfCheckpointsToRetain,
            userClassLoader,
              • End diff –

          True. Will correct it. Thanks for spotting it.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/2816#discussion_r88231145 — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/util/ZooKeeperUtils.java — @@ -237,11 +242,12 @@ public static CompletedCheckpointStore createCompletedCheckpoints( checkpointsPath += ZooKeeperSubmittedJobGraphStore.getPathForJob(jobId); return new ZooKeeperCompletedCheckpointStore( maxNumberOfCheckpointsToRetain, + maxNumberOfCheckpointsToRetain, userClassLoader, End diff – True. Will correct it. Thanks for spotting it.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/2816

          Build passed locally: https://travis-ci.org/tillrohrmann/flink/builds/177725509. Merging this PR.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/2816 Build passed locally: https://travis-ci.org/tillrohrmann/flink/builds/177725509 . Merging this PR.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann closed the pull request at:

          https://github.com/apache/flink/pull/2816

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann closed the pull request at: https://github.com/apache/flink/pull/2816
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flink/pull/2815

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/2815
          Hide
          till.rohrmann Till Rohrmann added a comment -

          Fixed in 1.2 via 3fb92d8687f03c1fac8b87396b2b5a7ff29f6dd6
          Fixed in 1.1.4 via f2e4c193e1fb6b0cf26861bc01c2f3d6bcd4d8f6

          Show
          till.rohrmann Till Rohrmann added a comment - Fixed in 1.2 via 3fb92d8687f03c1fac8b87396b2b5a7ff29f6dd6 Fixed in 1.1.4 via f2e4c193e1fb6b0cf26861bc01c2f3d6bcd4d8f6

            People

            • Assignee:
              till.rohrmann Till Rohrmann
              Reporter:
              till.rohrmann Till Rohrmann
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development