Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-5085

Execute CheckpointCoodinator's state discard calls asynchronously

    Details

      Description

      The CheckpointCoordinator discards under certain circumstances pending checkpoints or state handles. These discard operations can involve a blocking IO operation if the underlying state handle refers to a file which has to be deleted. In order to not block the calling thread, we should execute these calls in a dedicated IO executor.

        Issue Links

          Activity

          Hide
          xiaogang.shi Xiaogang Shi added a comment -

          Great, this is what i thought of in recent days. Our states are composed of thousands of files on HDFS. It takes a long time to delete them in sequence. A dedicated executor will help improve the performance.

          Show
          xiaogang.shi Xiaogang Shi added a comment - Great, this is what i thought of in recent days. Our states are composed of thousands of files on HDFS. It takes a long time to delete them in sequence. A dedicated executor will help improve the performance.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user tillrohrmann opened a pull request:

          https://github.com/apache/flink/pull/2825

          FLINK-5085 Execute CheckpointCoordinator's state discard calls asynchronously

          This PR is based on #2820 and #2815. Only the commit 77f618a is relevant.

          The `CheckpointCoordinator` is now given an `Executor` which is used to execute the state discard
          calls asynchronously. This will prevent blocking operations to be executed from within the
          calling thread. The provided `Executor` is the same executor as the one used for the cleanup in the `ZooKeeperStateHandleStore`.

          The executors are now gracefully shutdown after the `JobManager` has terminated. If the executors don't shut down in the given time (akka ask timeout), then the executors are shut down hard.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/tillrohrmann/flink makeCheckpointCoordinatorNotBlocking

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/2825.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #2825


          commit 50838531f305fb92b927ca51aaf4a635e0a07499
          Author: Till Rohrmann <trohrmann@apache.org>
          Date: 2016-11-15T21:45:04Z

          FLINK-5073 Use Executor to run ZooKeeper callbacks in ZooKeeperStateHandleStore

          Use dedicated Executor to run ZooKeeper callbacks in ZooKeeperStateHandleStore instead
          of running it in the ZooKeeper client's thread. The callback can be blocking because it
          discards state which might entail deleting files from disk.

          commit 00d0722da276251a836b4417a249123c5d7b3947
          Author: Till Rohrmann <trohrmann@apache.org>
          Date: 2016-11-16T17:33:54Z

          FLINK-5082 Pull ExecutorService lifecycle management out of the JobManager

          The provided ExecutorService will no longer be closed by the JobManager. Instead the
          lifecycle is managed outside of it where it was created. This will give a nicer behaviour,
          because it better seperates responsibilities.

          commit 6384b9b2cc3a327fc9638bfa2ac6a6a652a14f3c
          Author: Till Rohrmann <trohrmann@apache.org>
          Date: 2016-11-16T17:51:05Z

          Introduce dedicated Executor for blocking io operations

          commit 77f618a57bcb45ec710cab6081a070fb02658482
          Author: Till Rohrmann <trohrmann@apache.org>
          Date: 2016-11-17T14:39:11Z

          FLINK-5085 Execute CheckpointCoordinator's state discard calls asynchronously

          The CheckpointCoordinator is now given an Executor which is used to execute the state discard
          calls asynchronously. This will prevent blocking operations to be executed from within the
          calling thread.

          Shut down ExecutorServices gracefully


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/2825 FLINK-5085 Execute CheckpointCoordinator's state discard calls asynchronously This PR is based on #2820 and #2815. Only the commit 77f618a is relevant. The `CheckpointCoordinator` is now given an `Executor` which is used to execute the state discard calls asynchronously. This will prevent blocking operations to be executed from within the calling thread. The provided `Executor` is the same executor as the one used for the cleanup in the `ZooKeeperStateHandleStore`. The executors are now gracefully shutdown after the `JobManager` has terminated. If the executors don't shut down in the given time (akka ask timeout), then the executors are shut down hard. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink makeCheckpointCoordinatorNotBlocking Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2825.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2825 commit 50838531f305fb92b927ca51aaf4a635e0a07499 Author: Till Rohrmann <trohrmann@apache.org> Date: 2016-11-15T21:45:04Z FLINK-5073 Use Executor to run ZooKeeper callbacks in ZooKeeperStateHandleStore Use dedicated Executor to run ZooKeeper callbacks in ZooKeeperStateHandleStore instead of running it in the ZooKeeper client's thread. The callback can be blocking because it discards state which might entail deleting files from disk. commit 00d0722da276251a836b4417a249123c5d7b3947 Author: Till Rohrmann <trohrmann@apache.org> Date: 2016-11-16T17:33:54Z FLINK-5082 Pull ExecutorService lifecycle management out of the JobManager The provided ExecutorService will no longer be closed by the JobManager. Instead the lifecycle is managed outside of it where it was created. This will give a nicer behaviour, because it better seperates responsibilities. commit 6384b9b2cc3a327fc9638bfa2ac6a6a652a14f3c Author: Till Rohrmann <trohrmann@apache.org> Date: 2016-11-16T17:51:05Z Introduce dedicated Executor for blocking io operations commit 77f618a57bcb45ec710cab6081a070fb02658482 Author: Till Rohrmann <trohrmann@apache.org> Date: 2016-11-17T14:39:11Z FLINK-5085 Execute CheckpointCoordinator's state discard calls asynchronously The CheckpointCoordinator is now given an Executor which is used to execute the state discard calls asynchronously. This will prevent blocking operations to be executed from within the calling thread. Shut down ExecutorServices gracefully
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user tillrohrmann opened a pull request:

          https://github.com/apache/flink/pull/2826

          FLINK-5085 Execute CheckpointCoordinator's state discard calls asynchronously

          This PR is a back port of #2825 for the release 1.1 branch. It is based on #2816. Thus only a70097d is relevant.

          The `CheckpointCoordinator` is now given an `Executor` which is used to execute the state discard
          calls asynchronously. This will prevent blocking operations to be executed from within the
          calling thread. The provided `Executor` is the same executor as the one used for the cleanup in the `ZooKeeperStateHandleStore`.

          The executors are now gracefully shutdown after the `JobManager` has terminated. If the executors don't shut down in the given time (akka ask timeout), then the executors are shut down hard.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/tillrohrmann/flink backportMakeCheckpointCoordinatorNotBlocking

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/2826.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #2826


          commit 357690b359a2890ec1842a20d345675b79d61cd1
          Author: Till Rohrmann <trohrmann@apache.org>
          Date: 2016-11-15T21:45:04Z

          FLINK-5073 Use Executor to run ZooKeeper callbacks in ZooKeeperStateHandleStore

          Use dedicated Executor to run ZooKeeper callbacks in ZooKeeperStateHandleStore instead
          of running it in the ZooKeeper client's thread. The callback can be blocking because it
          discards state which might entail deleting files from disk.

          Add TestExecutors

          commit 640bfef9a176d57fa70d8ac21b8675897fae11ec
          Author: Till Rohrmann <trohrmann@apache.org>
          Date: 2016-11-16T17:33:54Z

          FLINK-5082 Pull ExecutorService lifecycle management out of the JobManager

          The provided ExecutorService will no longer be closed by the JobManager. Instead the
          lifecycle is managed outside of it where it was created. This will give a nicer behaviour,
          because it better seperates responsibilities.

          commit 9de05526e49158a5bde1342afe602f358cae993f
          Author: Till Rohrmann <trohrmann@apache.org>
          Date: 2016-11-16T17:51:05Z

          Introduce dedicated Executor for blocking io operations

          commit a70097d4ac619f9203604f6991d293a7b0f55b54
          Author: Till Rohrmann <trohrmann@apache.org>
          Date: 2016-11-17T14:39:11Z

          FLINK-5085 Execute CheckpointCoordinator's state discard calls asynchronously

          The CheckpointCoordinator is now given an Executor which is used to execute the state discard
          calls asynchronously. This will prevent blocking operations to be executed from within the
          calling thread.

          Shut down ExecutorServices gracefully


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/2826 FLINK-5085 Execute CheckpointCoordinator's state discard calls asynchronously This PR is a back port of #2825 for the release 1.1 branch. It is based on #2816. Thus only a70097d is relevant. The `CheckpointCoordinator` is now given an `Executor` which is used to execute the state discard calls asynchronously. This will prevent blocking operations to be executed from within the calling thread. The provided `Executor` is the same executor as the one used for the cleanup in the `ZooKeeperStateHandleStore`. The executors are now gracefully shutdown after the `JobManager` has terminated. If the executors don't shut down in the given time (akka ask timeout), then the executors are shut down hard. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink backportMakeCheckpointCoordinatorNotBlocking Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2826.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2826 commit 357690b359a2890ec1842a20d345675b79d61cd1 Author: Till Rohrmann <trohrmann@apache.org> Date: 2016-11-15T21:45:04Z FLINK-5073 Use Executor to run ZooKeeper callbacks in ZooKeeperStateHandleStore Use dedicated Executor to run ZooKeeper callbacks in ZooKeeperStateHandleStore instead of running it in the ZooKeeper client's thread. The callback can be blocking because it discards state which might entail deleting files from disk. Add TestExecutors commit 640bfef9a176d57fa70d8ac21b8675897fae11ec Author: Till Rohrmann <trohrmann@apache.org> Date: 2016-11-16T17:33:54Z FLINK-5082 Pull ExecutorService lifecycle management out of the JobManager The provided ExecutorService will no longer be closed by the JobManager. Instead the lifecycle is managed outside of it where it was created. This will give a nicer behaviour, because it better seperates responsibilities. commit 9de05526e49158a5bde1342afe602f358cae993f Author: Till Rohrmann <trohrmann@apache.org> Date: 2016-11-16T17:51:05Z Introduce dedicated Executor for blocking io operations commit a70097d4ac619f9203604f6991d293a7b0f55b54 Author: Till Rohrmann <trohrmann@apache.org> Date: 2016-11-17T14:39:11Z FLINK-5085 Execute CheckpointCoordinator's state discard calls asynchronously The CheckpointCoordinator is now given an Executor which is used to execute the state discard calls asynchronously. This will prevent blocking operations to be executed from within the calling thread. Shut down ExecutorServices gracefully
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/2826

          Rebased the PR on the latest release-1.1 branch.

          Review @uce, @StephanEwen if you have time.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/2826 Rebased the PR on the latest release-1.1 branch. Review @uce, @StephanEwen if you have time.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/2826

          Review @StefanRRichter

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/2826 Review @StefanRRichter
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StefanRRichter commented on the issue:

          https://github.com/apache/flink/pull/2826

          +1 LGTM

          Show
          githubbot ASF GitHub Bot added a comment - Github user StefanRRichter commented on the issue: https://github.com/apache/flink/pull/2826 +1 LGTM
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/2826

          Thanks for the review @StefanRRichter. Once Travis gives green light, I'll merge the PR.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/2826 Thanks for the review @StefanRRichter. Once Travis gives green light, I'll merge the PR.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/2826

          Build passed locally: https://travis-ci.org/tillrohrmann/flink/builds/177725509. Merging the PR.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/2826 Build passed locally: https://travis-ci.org/tillrohrmann/flink/builds/177725509 . Merging the PR.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann closed the pull request at:

          https://github.com/apache/flink/pull/2826

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann closed the pull request at: https://github.com/apache/flink/pull/2826
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/2825

          @StefanRRichter reviewed the backport #2826 of this PR which simply uses a different state discarding method and gave a +1. Since Travis passes as well, I'll merge the PR.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/2825 @StefanRRichter reviewed the backport #2826 of this PR which simply uses a different state discarding method and gave a +1. Since Travis passes as well, I'll merge the PR.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/2825

          Rebasing the PR.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/2825 Rebasing the PR.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/2825

          Build passed locally https://travis-ci.org/tillrohrmann/flink/builds/178011045. Merging this PR.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/2825 Build passed locally https://travis-ci.org/tillrohrmann/flink/builds/178011045 . Merging this PR.
          Hide
          till.rohrmann Till Rohrmann added a comment -

          Fixed in 1.2 via c590912c93a4059b40452dfa6cffbdd4d58cac13
          Fixed in 1.1.4 via cf4b221270cff3541bea318f907f9d8207b2fa4d

          Show
          till.rohrmann Till Rohrmann added a comment - Fixed in 1.2 via c590912c93a4059b40452dfa6cffbdd4d58cac13 Fixed in 1.1.4 via cf4b221270cff3541bea318f907f9d8207b2fa4d
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flink/pull/2825

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/2825

            People

            • Assignee:
              till.rohrmann Till Rohrmann
              Reporter:
              till.rohrmann Till Rohrmann
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development