Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-21928

DuplicateJobSubmissionException after JobManager failover

    XMLWordPrintableJSON

Details

    • Hide
      The fix for this problem only works if the ApplicationMode is used with a single job submission and if the user code does not access the `JobExecutionResult`. If any of these conditions is violated, then Flink cannot guarantee that the whole Flink application is executed.

      Additionally, it is still required that the user cleans up the corresponding HA entries for the running jobs registry because these entries won't be reliably cleaned up when encountering the situation described by FLINK-21928.
      Show
      The fix for this problem only works if the ApplicationMode is used with a single job submission and if the user code does not access the `JobExecutionResult`. If any of these conditions is violated, then Flink cannot guarantee that the whole Flink application is executed. Additionally, it is still required that the user cleans up the corresponding HA entries for the running jobs registry because these entries won't be reliably cleaned up when encountering the situation described by FLINK-21928 .

    Description

      Consider the following scenario:

      • Environment: StandaloneApplicationClusterEntryPoint using a fixed job ID, high availability enabled
      • Flink job reaches a globally terminal state
      • Flink job is marked as finished in the high-availability service's RunningJobsRegistry
      • The JobManager fails over

      On recovery, the Dispatcher throws DuplicateJobSubmissionException, because the job is marked as done in the RunningJobsRegistry.

      When this happens, users cannot get out of the situation without manually redeploying the JobManager process and changing the job ID^1^.

      The desired semantics are that we don't want to re-execute a job that has reached a globally terminal state. In this particular case, we know that the job has already reached such a state (as it has been marked in the registry). Therefore, we could handle this case by executing the regular termination sequence instead of throwing a DuplicateJobSubmission.

      1 With ZooKeeper HA, the respective node is not ephemeral. In Kubernetes HA, there is no  notion of ephemeral data that is tied to a session in the first place afaik.

       

      Attachments

        Issue Links

          Activity

            People

              dmvk David Morávek
              uce Ufuk Celebi
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: