Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-21979

Job can be restarted from the beginning after it reached a terminal state

    XMLWordPrintableJSON

Details

    Description

      Currently, the JobMaster removes all checkpoints after a job reaches a globally terminal state. Then it notifies the Dispatcher about the termination of the job. The Dispatcher then removes the job from the SubmittedJobGraphStore. If the Dispatcher process fails before doing that it might get restarted. In this case, the Dispatcher would still find the job in the SubmittedJobGraphStore and recover it. Since the CompletedCheckpointStore is empty, it would start executing this job from the beginning.

      I think we must not remove job state before the job has not been marked as done or made inaccessible for any restarted processes. Concretely, we should first remove the job from the SubmittedJobGraphStore and only then delete the checkpoints. Ideally all the job related cleanup operation happens atomically.

      Attachments

        Issue Links

          Activity

            People

              dmvk David Morávek
              trohrmann Till Rohrmann
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: