[FLINK-11159] Allow configuration whether to fall back to savepoints for restore - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.5.5, 1.6.2, 1.7.0
Fix Version/s: 1.9.0
Component/s: Runtime / Checkpointing
Labels:
- pull-request-available

Release Note:

Hide
The signature of the `CompletedCheckpointStore#getLatestCheckpoint` method has been changed from `getLatestCheckpoint()` to `getLatestCheckpoint(boolean)`. This signature change breaks backwards compatibility and requires you to update your `CompletedCheckpointStore` implementation.

If the parameter is `true`, then only checkpoints will be considered for recovery. Otherwise savepoints will be used for recoveries as well.

Show
The signature of the `CompletedCheckpointStore#getLatestCheckpoint` method has been changed from `getLatestCheckpoint()` to `getLatestCheckpoint(boolean)`. This signature change breaks backwards compatibility and requires you to update your `CompletedCheckpointStore` implementation. If the parameter is `true`, then only checkpoints will be considered for recovery. Otherwise savepoints will be used for recoveries as well.

Description

Ever since ~~FLINK-3397~~, upon failure, Flink would restart from the latest checkpoint/savepoint which ever is more recent. With the introduction of local recovery and the knowledge that a RocksDB checkpoint restore would just copy the files, it may be time to re-consider / making this configurable:
In certain situations, it may be faster to restore from the latest checkpoint only (even if there is a more recent savepoint) and reprocess the data between. On the downside, though, that may not be correct because that might break side effects if the savepoint was the latest one, e.g. consider this chain: chk1 -> chk2 -> sp … restore chk2 -> …. Then all side effects between chk2 -> sp would be reproduced.

Making this configurable will allow the user to set whatever he needs / can to get the lowest recovery time in Flink.

Attachments

Issue Links

causes

FLINK-14145 getLatestCheckpoint(true) returns wrong checkpoint

Resolved

FLINK-20427 Remove CheckpointConfig.setPreferCheckpointForRecovery because it can lead to data loss

Closed

FLINK-13692 Make CompletedCheckpointStore backwards compatible?

Closed

relates to

FLINK-8360 Implement task-local state recovery

Closed

FLINK-3397 Failed streaming jobs should fall back to the most recent checkpoint/savepoint

Closed

links to

GitHub Pull Request #8410

(1 links to)

Activity

People

Assignee:: vinoyang

Reporter:: Nico Kruber

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 13/Dec/18 16:53

Updated:: 30/Nov/20 16:31

Resolved:: 12/Aug/19 09:19

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

20m