[FLINK-22494] Avoid discarding checkpoints in case of failure - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.13.0, 1.14.0, 1.12.3
Fix Version/s: 1.14.0, 1.13.1, 1.12.5
Component/s: Runtime / Checkpointing, Runtime / Coordination
Labels:
- pull-request-available

Description

Both StateHandleStore implementations (i.e. KubernetesStateHandleStore:157 and ZooKeeperStateHandleStore:170) discard checkpoints if the checkpoint metadata wasn't written to the backend.

This does not cover the cases where the data was actually written to the backend but the call failed anyway (e.g. due to network issues). In such a case, we might end up having a pointer in the backend pointing to a checkpoint that was discarded.

Instead of discarding the checkpoint data in this case, we might want to keep it for this specific use case. Otherwise, we might run into Exceptions when recovering from the Checkpoint later on. We might want to add a warning to the user pointing to the possibly orphaned checkpoint data.

Attachments

Issue Links

is related to

FLINK-25098 Jobmanager CrashLoopBackOff in HA configuration

Open

FLINK-25265 RUNNING to FAILED with failure cause. This might indicate that the remote task manager was lost.

Open

relates to

FLINK-22502 DefaultCompletedCheckpointStore drops unrecoverable checkpoints silently

Resolved

FLINK-24543 Zookeeper connection issue causes inconsistent state in Flink

Closed

FLINK-22704 ZooKeeperHaServicesTest.testCleanupJobData failed

Closed

links to

GitHub Pull Request #15832

(1 links to)

Activity

People

Assignee:: Matthias Pohl

Reporter:: Matthias Pohl

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 27/Apr/21 16:24

Updated:: 05/Jul/22 07:03

Resolved:: 18/May/21 10:43