Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-5763

Make savepoints self-contained and relocatable

Agile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersStop watchingWatchersCreate sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Hide
      After FLINK-5763, we made savepoint self-contained and relocatable so that users can migrate savepoint from one place to another without any other processing manually. Currently do not support this feature after Entropy Injection enabled.
      Show
      After FLINK-5763 , we made savepoint self-contained and relocatable so that users can migrate savepoint from one place to another without any other processing manually. Currently do not support this feature after Entropy Injection enabled.

    Description

      After a user has triggered a savepoint, a single savepoint file will be returned as a handle to the savepoint. A savepoint to <target> creates a savepoint file like <target>/savepoint-<randomSuffix>.

      This file contains the metadata of the corresponding checkpoint, but not the actual program state. While this works well for short term management (pause-and-resume a job), it makes it hard to manage savepoints over longer periods of time.

      Problems

      Scattered Checkpoint Files

      For file system based checkpoints (FsStateBackend, RocksDBStateBackend) this results in the savepoint referencing files from the checkpoint directory (usually different than <target>). For users, it is virtually impossible to tell which checkpoint files belong to a savepoint and which are lingering around. This can easily lead to accidentally invalidating a savepoint by deleting checkpoint files.

      Savepoints Not Relocatable

      Even if a user is able to figure out which checkpoint files belong to a savepoint, moving these files will invalidate the savepoint as well, because the metadata file references absolute file paths.

      Forced to Use CLI for Disposal

      Because of the scattered files, the user is in practice forced to use Flink’s CLI to dispose a savepoint. This should be possible to handle in the scope of the user’s environment via a file system delete operation.

      Proposal

      In order to solve the described problems, savepoints should contain all their state, both metadata and program state, inside a single directory. Furthermore the metadata must only hold relative references to the checkpoint files. This makes it obvious which files make up the state of a savepoint and it is possible to move savepoints around by moving the savepoint directory.

      Desired File Layout

      Triggering a savepoint to <target> creates a directory as follows:

      <target>/savepoint-<jobId>-<randomSuffix>
        +-- _metadata
        +-- data-<randomSuffix> [1 or more]
      

      We include the JobID in the savepoint directory name in order to give some hints about which job a savepoint belongs to.

      CLI
      • Trigger: When triggering a savepoint to <target> the savepoint directory will be returned as the handle to the savepoint.
      • Restore: Users can restore by pointing to the directory or the _metadata file. The data files should be required to be in the same directory as the _metadata file.
      • Dispose: The disposal command should be deprecated and eventually removed. While deprecated, disposal can happen by specifying the directory or the _metadata file (same as restore).

      Attachments

        Issue Links

        There are no Sub-Tasks for this issue.

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            klion26 Congxian Qiu
            uce Ufuk Celebi
            Votes:
            4 Vote for this issue
            Watchers:
            23 Stop watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - Not Specified
              Not Specified
              Remaining:
              Remaining Estimate - 0h
              0h
              Logged:
              Time Spent - 10m
              10m

              Slack

                Issue deployment