Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-11813

Standby per job mode Dispatchers don't know job's JobSchedulingStatus

    XMLWordPrintableJSON

Details

    • Hide
      The issue of re-submitting a job in Application Mode when the job finished but failed during cleanup is fixed through the introduction of the new component JobResultStore which enables Flink to persist the cleanup state of a job to the file system. (see FLINK-25431)
      Show
      The issue of re-submitting a job in Application Mode when the job finished but failed during cleanup is fixed through the introduction of the new component JobResultStore which enables Flink to persist the cleanup state of a job to the file system. (see FLINK-25431 )

    Description

      At the moment, it can happen that standby Dispatchers in per job mode will restart a terminated job after they gained leadership. The problem is that we currently clear the RunningJobsRegistry once a job has reached a globally terminal state. After the leading Dispatcher terminates, a standby Dispatcher will gain leadership. Without having the information from the RunningJobsRegistry it cannot tell whether the job has been executed or whether the Dispatcher needs to re-execute the job. At the moment, the Dispatcher will assume that there was a fault and hence re-execute the job. This can lead to duplicate results.

      I think we need some way to tell standby Dispatchers that a certain job has been successfully executed. One trivial solution could be to not clean up the RunningJobsRegistry but then we will clutter ZooKeeper.

      Attachments

        Issue Links

          Activity

            People

              mapohl Matthias Pohl
              trohrmann Till Rohrmann
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: