Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-4911 Non-disruptive JobManager Failures via Reconciliation
  3. FLINK-5501

Determine whether the job starts from last JobManager failure

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.3.0
    • Runtime / Coordination
    • None

    Description

      When the JobManagerRunner grants leadership, it should check whether the current job is already running or not. If the job is running, the JobManager should reconcile itself (enter RECONCILING state) and waits for the TaskManager reporting task status. Otherwise the JobManger can schedule the ExecutionGraph in common way.

      The RunningJobsRegistry can provide the way to check the job running status, but we should expand the current interface and fix the related process to support this function.

      1. RunningJobsRegistry sets RUNNING status after JobManagerRunner granting leadership at the first time.

      2. If the job finishes, the job status will be set FINISHED by RunningJobsRegistry and the status will be deleted before exit.

      3. If the mini cluster starts multi JobManagerRunner, and the leader JobManagerRunner already finishes the job to set the job status FINISHED, other JobManagerRunner will exit after grants the leadership again.

      4. If the JobManager fails, the job status will be still in RUNNING. So if the JobManagerRunner (the previous or new one) grants leadership again, it will check the job status and enters RECONCILING state.

      Attachments

        Issue Links

          Activity

            People

              tiemsn shuai.xu
              zjwang Zhijiang
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: