[FLINK-21979] Job can be restarted from the beginning after it reached a terminal state - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.11.3, 1.12.2, 1.13.0
Fix Version/s: 1.14.0
Component/s: Runtime / Coordination
Labels:
- pull-request-available

Description

Currently, the JobMaster removes all checkpoints after a job reaches a globally terminal state. Then it notifies the Dispatcher about the termination of the job. The Dispatcher then removes the job from the SubmittedJobGraphStore. If the Dispatcher process fails before doing that it might get restarted. In this case, the Dispatcher would still find the job in the SubmittedJobGraphStore and recover it. Since the CompletedCheckpointStore is empty, it would start executing this job from the beginning.

I think we must not remove job state before the job has not been marked as done or made inaccessible for any restarted processes. Concretely, we should first remove the job from the SubmittedJobGraphStore and only then delete the checkpoints. Ideally all the job related cleanup operation happens atomically.

Attachments

Issue Links

is related to

FLINK-10333 Rethink ZooKeeper based stores (SubmittedJobGraph, MesosWorker, CompletedCheckpoints)

Open

FLINK-21928 DuplicateJobSubmissionException after JobManager failover

Closed

FLINK-11813 Standby per job mode Dispatchers don't know job's JobSchedulingStatus

Closed

relates to

FLINK-19816 Flink restored from a wrong checkpoint (a very old one and not the last completed one)

Closed

links to

GitHub Pull Request #16535

Activity

People

Assignee:: David Morávek

Reporter:: Till Rohrmann

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 25/Mar/21 17:34

Updated:: 02/Sep/21 12:57

Resolved:: 09/Aug/21 14:40