Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-5218

Eagerly close checkpoint streams on cancellation

    Details

      Description

      Some output streams perform blocking operations that cannot be properly interrupted. This causes cancellations to take very long when happening concurrently to large synchronous state snapshot operations.

      Closing the streams concurrently helps to abort these blocking operations.

      This might already be fixed in 1.2 by the CloseableRegistry.

        Issue Links

          Activity

          Hide
          StephanEwen Stephan Ewen added a comment -

          Fixed in

          • 1.1.4 via 59f61bf6cb8351cec9369e2de39c6eeffbda10ea
          • 1.2.0 via cc006ff18cc7032de3be3fdd9ef7ad383e88bba0
          Show
          StephanEwen Stephan Ewen added a comment - Fixed in 1.1.4 via 59f61bf6cb8351cec9369e2de39c6eeffbda10ea 1.2.0 via cc006ff18cc7032de3be3fdd9ef7ad383e88bba0
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen closed the pull request at:

          https://github.com/apache/flink/pull/2920

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen closed the pull request at: https://github.com/apache/flink/pull/2920
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/2920

          Manually merged in cc006ff18cc7032de3be3fdd9ef7ad383e88bba0

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2920 Manually merged in cc006ff18cc7032de3be3fdd9ef7ad383e88bba0
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/2920

          Thanks for the review, merging this...

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2920 Thanks for the review, merging this...
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StefanRRichter commented on the issue:

          https://github.com/apache/flink/pull/2920

          I think the changed semantics makes actually more sense. It should also be fine for all callers, as returning null to them was also previously possible and IRC there should be no special meaning to an empty handle in 1.1.

          Show
          githubbot ASF GitHub Bot added a comment - Github user StefanRRichter commented on the issue: https://github.com/apache/flink/pull/2920 I think the changed semantics makes actually more sense. It should also be fine for all callers, as returning null to them was also previously possible and IRC there should be no special meaning to an empty handle in 1.1.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/2920

          @StefanRRichter There is one change of semantics that would be good to get your input on: A checkpoint stream to which a `byte[0]` array was written is now actually empty and returns a `null` state handle in the same way as if nothing was ever written. Before this change, it would have created a state of zero bytes.

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2920 @StefanRRichter There is one change of semantics that would be good to get your input on: A checkpoint stream to which a `byte [0] ` array was written is now actually empty and returns a `null` state handle in the same way as if nothing was ever written. Before this change, it would have created a state of zero bytes.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user StephanEwen opened a pull request:

          https://github.com/apache/flink/pull/2920

          FLINK-5218 [state backends] Eagerly close checkpoint streams on cancellation

          When a task is canceled during a checkpoint operation, the operation needs to cancel fast.

          This is a forward fis from version 1.1, where checkpoints could get stuck when the state output streams did not handle interruptions correctly (HDFS has that problem).

          Most of this is already handled in version 1.2 via the CloseableRegistry.

          This adds a test to validate this case is handled correctly and adds minor changes to make it work reliably, like:

          • fail fast on `write()` on closed checkpoint streams
          • fail fast on `flush()` on closed checkpoint streams
          • slight optimization to save a flag in the checkpoint streams

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/StephanEwen/incubator-flink closing_validation

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/2920.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #2920


          commit e592c098f25f97b223f07ff84cd2fd9233e36dc4
          Author: Stephan Ewen <sewen@apache.org>
          Date: 2016-12-01T16:12:12Z

          FLINK-5218 [state backends] Add test that validates that Checkpoint Streams are eagerly closed on cancellation.

          This is important for some stream implementations (such as HDFS) that do not properly
          handle thread interruption.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user StephanEwen opened a pull request: https://github.com/apache/flink/pull/2920 FLINK-5218 [state backends] Eagerly close checkpoint streams on cancellation When a task is canceled during a checkpoint operation, the operation needs to cancel fast. This is a forward fis from version 1.1, where checkpoints could get stuck when the state output streams did not handle interruptions correctly (HDFS has that problem). Most of this is already handled in version 1.2 via the CloseableRegistry . This adds a test to validate this case is handled correctly and adds minor changes to make it work reliably, like: fail fast on `write()` on closed checkpoint streams fail fast on `flush()` on closed checkpoint streams slight optimization to save a flag in the checkpoint streams You can merge this pull request into a Git repository by running: $ git pull https://github.com/StephanEwen/incubator-flink closing_validation Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2920.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2920 commit e592c098f25f97b223f07ff84cd2fd9233e36dc4 Author: Stephan Ewen <sewen@apache.org> Date: 2016-12-01T16:12:12Z FLINK-5218 [state backends] Add test that validates that Checkpoint Streams are eagerly closed on cancellation. This is important for some stream implementations (such as HDFS) that do not properly handle thread interruption.
          Hide
          StephanEwen Stephan Ewen added a comment -

          Fixed in 1.1.4 via 59f61bf6cb8351cec9369e2de39c6eeffbda10ea

          Show
          StephanEwen Stephan Ewen added a comment - Fixed in 1.1.4 via 59f61bf6cb8351cec9369e2de39c6eeffbda10ea
          Hide
          StephanEwen Stephan Ewen added a comment -

          True, will close one of the issues.

          Show
          StephanEwen Stephan Ewen added a comment - True, will close one of the issues.
          Hide
          aljoscha Aljoscha Krettek added a comment -

          This seems to be a duplicate of FLINK-5215.

          Show
          aljoscha Aljoscha Krettek added a comment - This seems to be a duplicate of FLINK-5215 .

            People

            • Assignee:
              StephanEwen Stephan Ewen
              Reporter:
              StephanEwen Stephan Ewen
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development