[FLINK-22502] DefaultCompletedCheckpointStore drops unrecoverable checkpoints silently - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.11.3, 1.12.2, 1.13.0, 1.14.0
Fix Version/s: 1.14.0, 1.13.1, 1.12.5
Component/s: Runtime / Checkpointing, Runtime / Coordination
Labels:
- pull-request-available

Release Note:
On recovery, if a failure occurs during retrieval of a checkpoint, the job is restarted (instead of skipping the checkpoint in some circumstances). This prevents potential consistency violations.

Description

The DefaultCompletedCheckpointStore.recover() tries to be resilient if it cannot recover a checkpoint (e.g. due to a transient storage outage or a checkpoint being corrupted). This behaviour was introduced with ~~FLINK-7783~~.

The problem is that this behaviour might cause us to ignore the latest valid checkpoint if there is a transient problem when restoring it. This might be ok for at least once processing guarantees, but it clearly violates exactly once processing guarantees. On top of it, it is very hard to spot.

I propose to change this behaviour so that DefaultCompletedCheckpointStore.recover() fails if it cannot read the checkpoints it is supposed to read. If the recover method fails during a recovery, it will kill the process. This will usually restart the process which will retry the checkpoint recover operation. If the problem is of transient nature, then it should eventually succeed. In case that this problem occurs during an initial job submission, then the job will directly transition to a FAILED state.

The proposed behaviour entails that if there is a permanent problem with the checkpoint (e.g. corrupted checkpoint), then Flink won't be able to recover without the intervention of the user. I believe that this is the right decision because Flink can no longer give exactly once guarantees in this situation and a user needs to explicitly resolve this situation.

Attachments

Issue Links

causes

FLINK-22692 CheckpointStoreITCase.testRestartOnRecoveryFailure fails with RuntimeException

Resolved

is caused by

FLINK-7783 Don't always remove checkpoints in ZooKeeperCompletedCheckpointStore#recover()

Closed

is related to

FLINK-22494 Avoid discarding checkpoints in case of failure

Closed

links to

1.12 backport

1.13 backport

GitHub Pull Request #15846

(1 links to)

Activity

People

Assignee:: Roman Khachatryan

Reporter:: Till Rohrmann

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 28/Apr/21 07:47

Updated:: 28/Aug/21 12:09

Resolved:: 18/May/21 20:20