Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-29566

Reschedule the cleanup logic if cancel job failed

    XMLWordPrintableJSON

Details

    Description

      Currently, when we remove the FlinkSessionJob object,

      we always remove the object even if the Flink job is not being canceled successfully.

       

      This is not semantic consistent if the FlinkSessionJob has been removed but the Flink job is still running.

       

      One of the scenarios is that if we deploy a FlinkDeployment with HA mode.

      When we remove the FlinkSessionJob and change the FlinkDeployment at the same time,

      or if the TMs are restarting because of some bugs such as OOM.

      Both of these will cause the cancelation of the Flink job to fail because the TMs are not available.

       

      We should reschedule the cleanup logic if the FlinkDeployment is present.

      And we can add a new ReconciliationState DELETING to indicate the FlinkSessionJob's status.

       

      The logic will be

      if the FlinkDeployment is not present
          delete the FlinkSessionJob object
      else
          if the JM is not available
              reschedule
          else
              if cancel job successfully
                  delete the FlinkSessionJob object
              else
                  reschedule

      When we cancel the Flink job, we need to verify all the jobs with the same name have been deleted in case of the job id is changed after JM restarted.

       

       

      Attachments

        Issue Links

          Activity

            People

              haoxin Xin Hao
              haoxin Xin Hao
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: