Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-27114

On JM restart, the information about the initial checkpoints can be lost

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.14.4, 1.15.0, 1.16.0
    • None
    • None

    Description

      Scenario (1.14):

      1. A job starts from an existing checkpoint 1, with incremental checkpoints enabled
      2. Checkpoint 1 is loaded with discardOnSubsume=false by CheckpointCoordinator.restoreSavepoint
      3. A new checkpoint 2 completes, it reuses some state from the initial checkpoint
      4. At some point, checkpoint 1 is subsumed, but the state is not discarded (thanks to discardOnSubsume=false, ref counts stay 1)
      5. JM crashes
      6. JM restarts, loads the checkpoints 2..x from ZK (or other store) -   discardOnSubsume=true (as deserialized from handles)
      7. At some point, checkpoint 2 is subsumed and the initial shared state is not used anymore; because checkpoint 2 has discardOnSubsume=true, shared state will be erroneously discarded

      In 1.15, there were the following changes:

      1. RestoreMode was added; only LEGACY mode is affected (in NO_CLAIM mode, checkpoint 2 can't reuse any initial state; and in CLAIM mode, it's fine to discard the initial state)
      2. SharedStateRegistry was changed from refCounts to highest checkpoint ID (FLINK-24611)
      3. In step (7), state will not be discarded (FLINK-26985); however, because it's impossible to distinguish initial state from the state created by this job, the latter will not be discarded as well, leading to left-over state artifacts.

      The proposed solution is to store the initial checkpoint ID (in store such as ZK or in checkpoints) and adjust steps 6 or 7.

      Storing restore information in checkpoint allows to handle multiple restore modes in the "lineage", e.g.:

      Initial run -> restore in NO_CLAIM -> restore in CLAIM

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              roman Roman Khachatryan
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: