Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-4445

Ignore unmatched state when restoring from savepoint

    Details

      Description

      When currently submitting a job with a savepoint, we require that all state is matched to the new job. Many users have noted that this is overly strict. I would like to loosen this and allow savepoints to be restored without matching all state.

      The following options come to mind:

      (1) Keep the current behaviour, but add a flag to allow ignoring state when restoring, e.g. bin/flink -s <savepoint> --ignoreUnmatchedState. This would be non-API breaking.

      (2) Ignore unmatched state and continue. Additionally add a flag to be strict about checking the state, e.g. bin/flink -s <savepoint> --strict. This would be API-breaking as the default behaviour would change. Users might be confused by this because there is no straight forward way to notice that nothing has been restored.

      I'm not sure what's the best thing here. Gyula Fora, Aljoscha Krettek What do you think?

        Issue Links

          Activity

          Hide
          gyfora Gyula Fora added a comment -

          Hi Ufuk,

          My personal experience is that it's very easy to run into mistakes when dealing with more complex stateful job such as forget uids on kafka source/sink and other built-in stateful operators.

          Ignoring the unmatched state by default would be super dangerous and would have caused me serious issues in the past. I think adding a force ignore flag (option 1) would be the good way to go and is also very useful

          Cheers,
          Gyula

          Show
          gyfora Gyula Fora added a comment - Hi Ufuk, My personal experience is that it's very easy to run into mistakes when dealing with more complex stateful job such as forget uids on kafka source/sink and other built-in stateful operators. Ignoring the unmatched state by default would be super dangerous and would have caused me serious issues in the past. I think adding a force ignore flag (option 1) would be the good way to go and is also very useful Cheers, Gyula
          Hide
          uce Ufuk Celebi added a comment -

          Thanks Gyula! I agree with this. Furthermore, users would only need to use the flag once in a while, because after restoring with ignored state, newer savepoints can be restored like usual. +1 for option 1.

          Show
          uce Ufuk Celebi added a comment - Thanks Gyula! I agree with this. Furthermore, users would only need to use the flag once in a while, because after restoring with ignored state, newer savepoints can be restored like usual. +1 for option 1.
          Hide
          aljoscha Aljoscha Krettek added a comment -

          +1

          Show
          aljoscha Aljoscha Krettek added a comment - +1
          Hide
          StephanEwen Stephan Ewen added a comment -

          +1 for option (1)

          Show
          StephanEwen Stephan Ewen added a comment - +1 for option (1)
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user uce opened a pull request:

          https://github.com/apache/flink/pull/2712

          FLINK-4445 Add option to ignore unmapped checkpoint state

          When restoring from a checkpoint/savepoint, state for each operator has to be restored. For savepoints, this means that the user cannot remove an operator from her topology and still use the savepoint.

          With this change, we will allow to ignore state that cannot be mapped back to the job being restored. The default behaviour does not change.

            1. Changes
          • I've removed the `allOrNothingState` flag as it was only effecting non-partitioned operator state and never set to `true` anyways (except tests). The flag controlled whether each non-partitioned operator state was restored.
          • Moved the savepoint path from the `JobSnapshottingSettings` to the `JobGraph`
          • Added the `--ignoreUnmappedState` (short `-i`) flag to the run command: `bin/flink run -s <savepointPath> -i ...`

          I've tested this manually by triggering a savepoint for a job, adjusting the job (removing an operator), and then trying to resume from the savepoint. By default, restoring fails, but with the flag everything works.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/uce/flink 4445-unmatched_state

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/2712.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #2712


          commit dc278c51b2bf1f580a6b4cb1670fb70ac871515f
          Author: Ufuk Celebi <uce@apache.org>
          Date: 2016-10-26T16:00:21Z

          FLINK-4445 [client] Add ignoreUnmappedState flag to CLI

          Allow to specify whether a checkpoint restore should ignore
          checkpoint state that it cannot map to the program. This is
          exposed via the CLI in the run command:

          bin/flink run -s <savepointPath> -i ...

          Furthermore, the savepoint restore settings are moved out of
          the snapshotting settings.

          commit 57621d30dfc4360c786d557a1a00fb57e2ade372
          Author: Ufuk Celebi <uce@apache.org>
          Date: 2016-10-26T16:05:26Z

          FLINK-4445 [checkpointing] Add option to ignore unmapped checkpoint state

          Allows to ignore checkpoint state that cannot be mapped to a job vertex when
          restoring.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user uce opened a pull request: https://github.com/apache/flink/pull/2712 FLINK-4445 Add option to ignore unmapped checkpoint state When restoring from a checkpoint/savepoint, state for each operator has to be restored. For savepoints, this means that the user cannot remove an operator from her topology and still use the savepoint. With this change, we will allow to ignore state that cannot be mapped back to the job being restored. The default behaviour does not change. Changes I've removed the `allOrNothingState` flag as it was only effecting non-partitioned operator state and never set to `true` anyways (except tests). The flag controlled whether each non-partitioned operator state was restored. Moved the savepoint path from the `JobSnapshottingSettings` to the `JobGraph` Added the `--ignoreUnmappedState` (short `-i`) flag to the run command: `bin/flink run -s <savepointPath> -i ...` I've tested this manually by triggering a savepoint for a job, adjusting the job (removing an operator), and then trying to resume from the savepoint. By default, restoring fails, but with the flag everything works. You can merge this pull request into a Git repository by running: $ git pull https://github.com/uce/flink 4445-unmatched_state Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2712.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2712 commit dc278c51b2bf1f580a6b4cb1670fb70ac871515f Author: Ufuk Celebi <uce@apache.org> Date: 2016-10-26T16:00:21Z FLINK-4445 [client] Add ignoreUnmappedState flag to CLI Allow to specify whether a checkpoint restore should ignore checkpoint state that it cannot map to the program. This is exposed via the CLI in the run command: bin/flink run -s <savepointPath> -i ... Furthermore, the savepoint restore settings are moved out of the snapshotting settings. commit 57621d30dfc4360c786d557a1a00fb57e2ade372 Author: Ufuk Celebi <uce@apache.org> Date: 2016-10-26T16:05:26Z FLINK-4445 [checkpointing] Add option to ignore unmapped checkpoint state Allows to ignore checkpoint state that cannot be mapped to a job vertex when restoring.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user uce opened a pull request:

          https://github.com/apache/flink/pull/2713

          FLINK-4445 Add option to ignore unmapped checkpoint state

          Backport of #2712 for `release-1.1`.

          Technically, this adds new behaviour to a bugfix release, but the default behaviour is not changed and multiple users already ran into this. In such a case, there is no straight forward way to work around this issue.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/uce/flink 4445-unmatched_state-backport_1.1

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/2713.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #2713


          commit d45e13d2f1e9143458b8c45e2c5201196bf70375
          Author: Ufuk Celebi <uce@apache.org>
          Date: 2016-10-26T16:00:21Z

          FLINK-4445 [client] Add ignoreUnmappedState flag to CLI

          Allow to specify whether a checkpoint restore should ignore
          checkpoint state that it cannot map to the program. This is
          exposed via the CLI in the run command:

          bin/flink run -s <savepointPath> -i ...

          Furthermore, the savepoint restore settings are moved out of
          the snapshotting settings.

          commit 91f677d8906d7ad92a6919b7756011280a20a5f7
          Author: Ufuk Celebi <uce@apache.org>
          Date: 2016-10-27T07:49:01Z

          FLINK-4445 [checkpointing] Add option to ignore unmatched savepoint state

          Allows to ignore savepoint state that cannot be mapped to a job vertex when
          restoring.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user uce opened a pull request: https://github.com/apache/flink/pull/2713 FLINK-4445 Add option to ignore unmapped checkpoint state Backport of #2712 for `release-1.1`. Technically, this adds new behaviour to a bugfix release, but the default behaviour is not changed and multiple users already ran into this. In such a case, there is no straight forward way to work around this issue. You can merge this pull request into a Git repository by running: $ git pull https://github.com/uce/flink 4445-unmatched_state-backport_1.1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2713.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2713 commit d45e13d2f1e9143458b8c45e2c5201196bf70375 Author: Ufuk Celebi <uce@apache.org> Date: 2016-10-26T16:00:21Z FLINK-4445 [client] Add ignoreUnmappedState flag to CLI Allow to specify whether a checkpoint restore should ignore checkpoint state that it cannot map to the program. This is exposed via the CLI in the run command: bin/flink run -s <savepointPath> -i ... Furthermore, the savepoint restore settings are moved out of the snapshotting settings. commit 91f677d8906d7ad92a6919b7756011280a20a5f7 Author: Ufuk Celebi <uce@apache.org> Date: 2016-10-27T07:49:01Z FLINK-4445 [checkpointing] Add option to ignore unmatched savepoint state Allows to ignore savepoint state that cannot be mapped to a job vertex when restoring.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user uce commented on the issue:

          https://github.com/apache/flink/pull/2712

          Thanks for the review. Going to merge this.

          Show
          githubbot ASF GitHub Bot added a comment - Github user uce commented on the issue: https://github.com/apache/flink/pull/2712 Thanks for the review. Going to merge this.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user uce commented on the issue:

          https://github.com/apache/flink/pull/2713

          Going to merge this as #2712 was reviewed and this is essentially the same.

          Show
          githubbot ASF GitHub Bot added a comment - Github user uce commented on the issue: https://github.com/apache/flink/pull/2713 Going to merge this as #2712 was reviewed and this is essentially the same.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/2712

          Looks good to me, except the name `ignoreUnmappedState`
          As usual, the three most difficult things in computer science are (1) finding good names, and (2) off-by-one errors.

          What do you think about calling something like `allowUnresumedState` or `allowNonRestoredState`? The "allow" to me implies that this is a valid scenario.

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2712 Looks good to me, except the name `ignoreUnmappedState` As usual, the three most difficult things in computer science are (1) finding good names, and (2) off-by-one errors. What do you think about calling something like `allowUnresumedState` or `allowNonRestoredState`? The "allow" to me implies that this is a valid scenario.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user uce commented on the issue:

          https://github.com/apache/flink/pull/2712

          Very much agree Stephan! I don't know what sounds better to native speakers and more intuitive to users though... unresumed state or non restored state? @greghogan @jgrier do you have any input on this?

          The internal behaviour is the following: The checkpoint/savepoint stores state for each operator of the original job graph (from which the checkpoint/savepoint was triggered) keyed by the operator ID. When a user resumes from this checkpoint/savepoint, the checkpoint coordinator tries to map each state (keyed by operator ID) to the operators of the new job. This PR allows that some of this state is not restored. Any ideas on how to call this?

          Show
          githubbot ASF GitHub Bot added a comment - Github user uce commented on the issue: https://github.com/apache/flink/pull/2712 Very much agree Stephan! I don't know what sounds better to native speakers and more intuitive to users though... unresumed state or non restored state? @greghogan @jgrier do you have any input on this? The internal behaviour is the following: The checkpoint/savepoint stores state for each operator of the original job graph (from which the checkpoint/savepoint was triggered) keyed by the operator ID. When a user resumes from this checkpoint/savepoint, the checkpoint coordinator tries to map each state (keyed by operator ID) to the operators of the new job. This PR allows that some of this state is not restored. Any ideas on how to call this?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flink/pull/2712

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/2712
          Hide
          uce Ufuk Celebi added a comment -

          Fixed in 74c0770 c0e620f (master) and 1f91261 da32af1 (release-1.1).

          Show
          uce Ufuk Celebi added a comment - Fixed in 74c0770 c0e620f (master) and 1f91261 da32af1 (release-1.1).
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user uce closed the pull request at:

          https://github.com/apache/flink/pull/2713

          Show
          githubbot ASF GitHub Bot added a comment - Github user uce closed the pull request at: https://github.com/apache/flink/pull/2713

            People

            • Assignee:
              uce Ufuk Celebi
              Reporter:
              uce Ufuk Celebi
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development