[FLINK-12144] Option to prefer checkpoints on recovery - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Minor
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: None
Component/s: Runtime / Checkpointing
Labels:
None

Description

When a streaming job fails the getLatestCheckpoint() of the CheckpointStore is used to determine which checkpoint or savepoint is going to be used for recovery.

This behaviour is perfectly fine for jobs with relatively small states or where there are no strong SLAs but it some cases it can be problematic.

For jobs with a very large state size, the difference between recovery times from savepoints and checkpoints can be substantial to the point where it might break a use-case. So we would like to avoid ever recovering from a savepoint if a not too old checkpoint is also readily available.

This cannot be avoided right now if a job fails after we took a savepoint maybe for backup purposes (maybe it is scheduled multiple times a day).

I suggest we add a configuration option to allow the job to fall back to an earlier checkpoint (within maybe a certain age limit) even if there is a newer savepoint available to avoid lengthy downtimes.

Attachments

Activity

People

Assignee:: vinoyang

Reporter:: Gyula Fora

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 09/Apr/19 09:43

Updated:: 13/Apr/21 20:40

Resolved:: 09/Apr/19 12:20