Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-2976

Save and load checkpoints manually

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Implemented
    • 0.10.0
    • 1.0.0
    • Runtime / Coordination
    • None

    Description

      Currently, all checkpointed state is bound to a job. After the job finishes all state is lost. In case of an HA cluster, jobs can live longer than the cluster, but they still suffer from the same issue when they finish.

      Multiple users have requested the feature to manually save a checkpoint in order to resume from it at a later point. This is especially important for production environments. As an example, consider upgrading your existing production Flink program. Currently, you loose all the state of your program. With the proposed mechanism, it will be possible to save a checkpoint, stop and update your program, and then continue your program with the checkpoint.

      The required operations can be simple:

      saveCheckpoint(JobID) => checkpointID: long

      loadCheckpoint(JobID, long) => void

      For the initial version, I would apply the following restriction:

      • The topology needs to stay the same (JobGraph parallelism, etc.)

      A user can configure this behaviour via the environment like the checkpointing interval. Furthermore, the user can trigger the save operation via the command line at arbitrary times and load a checkpoint when submitting a job, e.g.

      bin/flink checkpoint <JobID> => checkpointID: long

      and

      bin/flink run --loadCheckpoint JobID [latest saved checkpoint]
      bin/flink run --loadCheckpoint (JobID,long) [specific saved checkpoint]

      As far as I can tell, the required mechanisms are similar to the ones implemented for JobManager high availability. We need to make sure to persist the CompletedCheckpoint instances as a pointer to the checkpoint state and to not remove saved checkpoint state.

      On the client side, we need to give the job and its vertices the same IDs to allow mapping the checkpoint state.

      Attachments

        Issue Links

          Activity

            People

              uce Ufuk Celebi
              uce Ufuk Celebi
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: