[FLINK-27114] On JM restart, the information about the initial checkpoints can be lost - ASF JIRA

XML

Word

Printable

JSON

Scenario (1.14):

A job starts from an existing checkpoint 1, with incremental checkpoints enabled
Checkpoint 1 is loaded with discardOnSubsume=false by CheckpointCoordinator.restoreSavepoint
A new checkpoint 2 completes, it reuses some state from the initial checkpoint
At some point, checkpoint 1 is subsumed, but the state is not discarded (thanks to discardOnSubsume=false, ref counts stay 1)
JM crashes
JM restarts, loads the checkpoints 2..x from ZK (or other store) - discardOnSubsume=true (as deserialized from handles)
At some point, checkpoint 2 is subsumed and the initial shared state is not used anymore; because checkpoint 2 has discardOnSubsume=true, shared state will be erroneously discarded

In 1.15, there were the following changes:

RestoreMode was added; only LEGACY mode is affected (in NO_CLAIM mode, checkpoint 2 can't reuse any initial state; and in CLAIM mode, it's fine to discard the initial state)
SharedStateRegistry was changed from refCounts to highest checkpoint ID (~~FLINK-24611~~)
In step (7), state will not be discarded (~~FLINK-26985~~); however, because it's impossible to distinguish initial state from the state created by this job, the latter will not be discarded as well, leading to left-over state artifacts.

The proposed solution is to store the initial checkpoint ID (in store such as ZK or in checkpoints) and adjust steps 6 or 7.

Storing restore information in checkpoint allows to handle multiple restore modes in the "lineage", e.g.:

Initial run -> restore in NO_CLAIM -> restore in CLAIM

is related to

FLINK-27132 CheckpointResourcesCleanupRunner might discard shared state of the initial checkpoint

FLINK-26985 With legacy restore mode, incremental checkpoints would be deleted by mistake