[FLINK-21351] Incremental checkpoint data would be lost once a non-stop savepoint completed - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.11.3, 1.12.1, 1.13.0
Fix Version/s: 1.12.2, 1.13.0
Component/s: Runtime / Checkpointing
Labels:
- pull-request-available

Description

~~FLINK-10354~~ counted savepoint as retained checkpoint so that job could failover from latest position. I think this operation is reasonable, however, current implementation would let incremental checkpoint data lost immediately once a non-stop savepoint completed.

Current general phase of incremental checkpoints: once a newer checkpoint completed, it would be added to checkpoint store. And if the size of completed checkpoints larger than max retained limit, it would subsume the oldest one. This lead to the reference of incremental data decrease one and data would be deleted once reference reached to zero. As we always ensure to register newer checkpoint and then unregister older checkpoint, current phase works fine as expected.

However, if a non-stop savepoint (a median manual trigger savepoint) is completed, it would be also added into checkpoint store and just subsume previous added checkpoint (in default retain one checkpoint case), which would unregister older checkpoint without newer checkpoint registered, leading to data lost.

Thanks for banmoy reporting this problem first.

Attachments

Issue Links

links to

backport to 1.12

GitHub Pull Request #14953

Activity

People

Assignee:: Roman Khachatryan

Reporter:: Yun Tang

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 10/Feb/21 15:16

Updated:: 03/Mar/21 21:49

Resolved:: 23/Feb/21 15:59