Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-4754

Make number of retained checkpoints user configurable

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.1.2
    • Fix Version/s: 1.3.0
    • Labels:
      None

      Description

      The number of retained successful checkpoints is fixed to 1. Expose this setting via the CheckpointConfig instead of having it fixed as a static field in the CheckpointRecoveryFactory.

      With the current state of things, this would require to set this value lazily in the checkpoint store implementations instead of setting it when creating the store.

        Issue Links

          Activity

          Hide
          StephanEwen Stephan Ewen added a comment -

          Fixed via b46f5e050bdd77fe6e501bad20466d8777218131

          Show
          StephanEwen Stephan Ewen added a comment - Fixed via b46f5e050bdd77fe6e501bad20466d8777218131
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flink/pull/3374

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/3374
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tony810430 commented on the issue:

          https://github.com/apache/flink/pull/3374

          @StephanEwen `state.checkpoints.num-retained` is also good to me. Choose what you think is proper to you. Thank you.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tony810430 commented on the issue: https://github.com/apache/flink/pull/3374 @StephanEwen `state.checkpoints.num-retained` is also good to me. Choose what you think is proper to you. Thank you.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tony810430 commented on the issue:

          https://github.com/apache/flink/pull/3374

          @StephanEwen I think `state.checkpoints.max-num-retained` is more clear to me.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tony810430 commented on the issue: https://github.com/apache/flink/pull/3374 @StephanEwen I think `state.checkpoints.max-num-retained` is more clear to me.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/3374

          @tony810430 I am adding small followups. Most notably, I renamed the config parameter to `state.checkpoints.num-retained`. In my experience, shorter parameter names are better (less typos, easier to remember for users, etc...)

          Please comment if you have objections.

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3374 @tony810430 I am adding small followups. Most notably, I renamed the config parameter to `state.checkpoints.num-retained`. In my experience, shorter parameter names are better (less typos, easier to remember for users, etc...) Please comment if you have objections.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/3374

          Looks good, thanks.
          Merging this...

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3374 Looks good, thanks. Merging this...
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tony810430 commented on the issue:

          https://github.com/apache/flink/pull/3374

          Hi @StephanEwen

          Thank you for the review. I have rebased and done those improvements in your suggestion.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tony810430 commented on the issue: https://github.com/apache/flink/pull/3374 Hi @StephanEwen Thank you for the review. I have rebased and done those improvements in your suggestion.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/3374

          Looks good in general. I would suggest two improvements:

          1. Migrate the config parameter to `CoreOptions`
          2. Add a test

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3374 Looks good in general. I would suggest two improvements: 1. Migrate the config parameter to `CoreOptions` 2. Add a test
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tony810430 commented on the issue:

          https://github.com/apache/flink/pull/3374

          Hi @StephanEwen

          Thanks for your comment and I make some change on this PR. I would appreciate it if you have time to review it.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tony810430 commented on the issue: https://github.com/apache/flink/pull/3374 Hi @StephanEwen Thanks for your comment and I make some change on this PR. I would appreciate it if you have time to review it.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/3374

          I think having it only in the configuration is probably fine. I think we do not need both paths here.
          Illegal values are probably checked when creating the recovery factory.

          It would be nice to have a "configuration validator" early, but we do not have something like that currently.

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3374 I think having it only in the configuration is probably fine. I think we do not need both paths here. Illegal values are probably checked when creating the recovery factory. It would be nice to have a "configuration validator" early, but we do not have something like that currently.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tony810430 commented on the issue:

          https://github.com/apache/flink/pull/3374

          Hi @StephanEwen thanks for your feedback.

          I totally agree your opinion. I will make this setting be configured in `FlinkConfiguration` and be passed to `CheckpointRecoveryFactory`.

          Besides, I have some questions for the following implementations.

          • Should the verification of this setting throw exception or make it be the default value when the setting is set with illegal value like `-1`.
          • Where should I verify this setting If we need to throw the exception? I couldn't find the validation util for `FlinkConfiguration`, so I don't know where I should place my code.
          • Should I remain the setting to provide flexibility for developers or just make it be the ops' responsibility only?

          Looking forward to having your opinion. Thank you.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tony810430 commented on the issue: https://github.com/apache/flink/pull/3374 Hi @StephanEwen thanks for your feedback. I totally agree your opinion. I will make this setting be configured in `FlinkConfiguration` and be passed to `CheckpointRecoveryFactory`. Besides, I have some questions for the following implementations. Should the verification of this setting throw exception or make it be the default value when the setting is set with illegal value like `-1`. Where should I verify this setting If we need to throw the exception? I couldn't find the validation util for `FlinkConfiguration`, so I don't know where I should place my code. Should I remain the setting to provide flexibility for developers or just make it be the ops ' responsibility only? Looking forward to having your opinion. Thank you.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/3374

          Hi @tony810430 thank you for the pull request!
          The code looks good.

          My feeling is, though, that the number of checkpoints to retain is something that we want rather in the configuration of the JobManager, than in the programs snapshot settings.

          Think of it like that: There are often two roles, developer and ops.

          • The developer writes the streaming program, and sets the values for the snapshot settings, like what checkpoint interval would work well for the application etc.
          • The ops person writes the cluster's config and is concerned with running the job reliably.

          Having multiple retained checkpoints is something that concerns more the ops person.

          What do you think?

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3374 Hi @tony810430 thank you for the pull request! The code looks good. My feeling is, though, that the number of checkpoints to retain is something that we want rather in the configuration of the JobManager, than in the programs snapshot settings. Think of it like that: There are often two roles, developer and ops . The developer writes the streaming program, and sets the values for the snapshot settings, like what checkpoint interval would work well for the application etc. The ops person writes the cluster's config and is concerned with running the job reliably. Having multiple retained checkpoints is something that concerns more the ops person. What do you think?
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user tony810430 opened a pull request:

          https://github.com/apache/flink/pull/3374

          FLINK-4754 [checkpoints] Make number of retained checkpoints user configurable

          I add `CheckpointConfig.setMaxNumberOfCheckpointsToRetain` to expose user the configuration for number of retained checkpoints, and update the constructor of `JobSnapshottingSettings` to pass the value to `CheckpointRecoveryFactory`.

          However, I didn't make this value lazily in the checkpoint store implementations.
          It is useful to make it lazily if there is a need to reconfigure it during job is running, but I think that should be another issue.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/tony810430/flink FLINK-4754

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/3374.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #3374


          commit d475017c29d9c8b91f740227b439d9fa7da152b3
          Author: Tony Wei <tony19920430@gmail.com>
          Date: 2017-02-20T10:30:24Z

          FLINK-4754 Make number of retained checkpoints user configurable


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user tony810430 opened a pull request: https://github.com/apache/flink/pull/3374 FLINK-4754 [checkpoints] Make number of retained checkpoints user configurable I add `CheckpointConfig.setMaxNumberOfCheckpointsToRetain` to expose user the configuration for number of retained checkpoints, and update the constructor of `JobSnapshottingSettings` to pass the value to `CheckpointRecoveryFactory`. However, I didn't make this value lazily in the checkpoint store implementations. It is useful to make it lazily if there is a need to reconfigure it during job is running, but I think that should be another issue. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tony810430/flink FLINK-4754 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3374.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3374 commit d475017c29d9c8b91f740227b439d9fa7da152b3 Author: Tony Wei <tony19920430@gmail.com> Date: 2017-02-20T10:30:24Z FLINK-4754 Make number of retained checkpoints user configurable

            People

            • Assignee:
              tonywei Wei-Che Wei
              Reporter:
              uce Ufuk Celebi
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development