Details
-
Bug
-
Status: Closed
-
Blocker
-
Resolution: Fixed
-
1.17.1, kubernetes-operator-1.6.0
-
Flink: 1.17.1
Flink Kubernetes Operator: 1.6.0
Description
We encountered a problem where the operator unexpectedly deleted HA data.
The timeline is as follows:
12:08 We submitted the first spec, which suspended the job with savepoint upgrade mode.
12:08 The job was suspended, while the HA data was preserved, and the log showed the observed job deployment status was MISSING.
12:10 We submitted the second spec, which deployed the job with the last state upgrade mode.
12:10 Logs showed the operator deleted both the Flink deployment and the HA data again.
12:10 The job failed to start because the HA data was missing.
According to the log, the deletion was triggered by https://github.com/apache/flink-kubernetes-operator/blob/a728ba768e20236184e2b9e9e45163304b8b196c/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconciler.java#L168
I think this would only be triggered if the job deployment status wasn't MISSING. But the log before the deletion showed the observed job status was MISSING at that moment.
Related logs:
2023-08-30 12:08:48.190 +0000 o.a.f.k.o.s.AbstractFlinkService [INFO ][default/pipeline-pipeline-se-3] Cluster shutdown completed. 2023-08-30 12:10:27.010 +0000 o.a.f.k.o.o.d.ApplicationObserver [INFO ][default/pipeline-pipeline-se-3] Observing JobManager deployment. Previous status: MISSING 2023-08-30 12:10:27.533 +0000 o.a.f.k.o.l.AuditUtils [INFO ][default/pipeline-pipeline-se-3] >>> Event | Info | SPECCHANGED | UPGRADE change(s) detected (Diff: FlinkDeploymentSpec[image : docker-registry.randomcompany.com/octopus/pipeline-pipeline-online:0835137c-362 -> docker-registry.randomcompany.com/octopus/pipeline-pipeline-online:23db7ae8-365, podTemplate.metadata.labels.app.kubernetes.io~1version : 0835137cd803b7258695eb53a6ec520cb62a48a7 -> 23db7ae84bdab8d91fa527fe2f8f2fce292d0abc, job.state : suspended -> running, job.upgradeMode : last-state -> savepoint, restartNonce : 1545 -> 1547]), starting reconciliation. 2023-08-30 12:10:27.679 +0000 o.a.f.k.o.s.NativeFlinkService [INFO ][default/pipeline-pipeline-se-3] Deleting JobManager deployment and HA metadata.
A more complete log file is attached. Thanks.