Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-33011

Operator deletes HA data unexpectedly

    XMLWordPrintableJSON

Details

    Description

      We encountered a problem where the operator unexpectedly deleted HA data.

      The timeline is as follows:

      12:08 We submitted the first spec, which suspended the job with savepoint upgrade mode.

      12:08 The job was suspended, while the HA data was preserved, and the log showed the observed job deployment status was MISSING.

      12:10 We submitted the second spec, which deployed the job with the last state upgrade mode.

      12:10 Logs showed the operator deleted both the Flink deployment and the HA data again.

      12:10 The job failed to start because the HA data was missing.

      According to the log, the deletion was triggered by https://github.com/apache/flink-kubernetes-operator/blob/a728ba768e20236184e2b9e9e45163304b8b196c/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconciler.java#L168

      I think this would only be triggered if the job deployment status wasn't MISSING. But the log before the deletion showed the observed job status was MISSING at that moment.

      Related logs:

       

      2023-08-30 12:08:48.190 +0000 o.a.f.k.o.s.AbstractFlinkService [INFO ][default/pipeline-pipeline-se-3] Cluster shutdown completed.
      2023-08-30 12:10:27.010 +0000 o.a.f.k.o.o.d.ApplicationObserver [INFO ][default/pipeline-pipeline-se-3] Observing JobManager deployment. Previous status: MISSING
      2023-08-30 12:10:27.533 +0000 o.a.f.k.o.l.AuditUtils         [INFO ][default/pipeline-pipeline-se-3] >>> Event  | Info    | SPECCHANGED     | UPGRADE change(s) detected (Diff: FlinkDeploymentSpec[image : docker-registry.randomcompany.com/octopus/pipeline-pipeline-online:0835137c-362 -> docker-registry.randomcompany.com/octopus/pipeline-pipeline-online:23db7ae8-365, podTemplate.metadata.labels.app.kubernetes.io~1version : 0835137cd803b7258695eb53a6ec520cb62a48a7 -> 23db7ae84bdab8d91fa527fe2f8f2fce292d0abc, job.state : suspended -> running, job.upgradeMode : last-state -> savepoint, restartNonce : 1545 -> 1547]), starting reconciliation.
      2023-08-30 12:10:27.679 +0000 o.a.f.k.o.s.NativeFlinkService [INFO ][default/pipeline-pipeline-se-3] Deleting JobManager deployment and HA metadata.
      

      A more complete log file is attached. Thanks.

      Attachments

        1. flink_operator_logs_0831.csv
          52 kB
          Ruibin Xing

        Activity

          People

            gyfora Gyula Fora
            ruibin Ruibin Xing
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: