Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
1.10.3, 1.11.3, 1.12.2, 1.13.0
-
StandaloneApplicationClusterEntryPoint using a fixed job ID, High Availability enabled
Description
Consider the following scenario:
- Environment: StandaloneApplicationClusterEntryPoint using a fixed job ID, high availability enabled
- Flink job reaches a globally terminal state
- Flink job is marked as finished in the high-availability service's RunningJobsRegistry
- The JobManager fails over
On recovery, the Dispatcher throws DuplicateJobSubmissionException, because the job is marked as done in the RunningJobsRegistry.
When this happens, users cannot get out of the situation without manually redeploying the JobManager process and changing the job ID^1^.
The desired semantics are that we don't want to re-execute a job that has reached a globally terminal state. In this particular case, we know that the job has already reached such a state (as it has been marked in the registry). Therefore, we could handle this case by executing the regular termination sequence instead of throwing a DuplicateJobSubmission.
—
1 With ZooKeeper HA, the respective node is not ephemeral. In Kubernetes HA, there is no notion of ephemeral data that is tied to a session in the first place afaik.
Attachments
Issue Links
- is duplicated by
-
FLINK-26333 Repeated exception with DuplicateJobSubmissionException: Job has already been submitted
- Open
- is related to
-
FLINK-11813 Standby per job mode Dispatchers don't know job's JobSchedulingStatus
- Closed
- relates to
-
FLINK-21979 Job can be restarted from the beginning after it reached a terminal state
- Closed
-
FLINK-21980 ZooKeeperRunningJobsRegistry creates an empty znode
- Closed
- links to