Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.14.2
-
None
-
AWS EKS v1.21
Operator version: 1.1.0
Description
When I update the configuration of a flink deployment I observe one of two scenarios:
Success:
This happens when the job has not started - if I change the configuration quick enough:
2022-10-13 06:50:54,336 o.a.f.k.o.r.d.AbstractJobReconciler [INFO ][load-streaming/validator-process-124] Upgrading/Restarting running job, suspending first... 2022-10-13 06:50:54,343 o.a.f.k.o.r.d.ApplicationReconciler [INFO ][load-streaming/validator-process-124] Job is not running but HA metadata is available for last state restore, ready for upgrade 2022-10-13 06:50:54,353 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Deleting JobManager deployment while preserving HA metadata. 2022-10-13 06:50:58,415 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (5s) 2022-10-13 06:51:03,451 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (10s) 2022-10-13 06:51:06,469 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Cluster shutdown completed. 2022-10-13 06:51:06,470 o.a.f.k.o.c.FlinkDeploymentController [INFO ][load-streaming/validator-process-124] End of reconciliation 2022-10-13 06:51:06,493 o.a.f.k.o.c.FlinkDeploymentController [INFO ][load-streaming/validator-process-124] Starting reconciliation 2022-10-13 06:51:06,494 o.a.f.k.o.c.FlinkConfigManager [INFO ][load-streaming/validator-process-124] Generating new config
In this scenario I see that the job manager and task manager pods are terminated and then recreated.
Failure:
This happens when I let the job start (wait more than 30-60 seconds) and change the configuration:
2022-10-13 06:53:06,637 o.a.f.k.o.r.d.AbstractJobReconciler [INFO ][load-streaming/validator-process-124] Upgrading/Restarting running job, suspending first... 2022-10-13 06:53:06,637 o.a.f.k.o.r.d.AbstractJobReconciler [INFO ][load-streaming/validator-process-124] Job is in running state, ready for upgrade with SAVEPOINT 2022-10-13 06:53:06,659 o.a.f.k.o.s.FlinkService [INFO ][load-streaming/validator-process-124] Suspending job with savepoint. 2022-10-13 06:53:07,042 o.a.f.k.o.s.FlinkService [INFO ][load-streaming/validator-process-124] Job successfully suspended with savepoint s3://cu-flink-load-checkpoints-us-east-1/validator-process-124/savepoints/savepoint-000000-947975b509b2. 2022-10-13 06:53:11,111 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (5s) 2022-10-13 06:53:16,176 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (10s) 2022-10-13 06:53:21,238 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (15s) 2022-10-13 06:53:26,293 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (20s) 2022-10-13 06:53:31,355 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (25s) 2022-10-13 06:53:36,412 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (30s) 2022-10-13 06:53:41,512 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (35s) 2022-10-13 06:53:46,568 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (40s) 2022-10-13 06:53:51,625 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (45s) 2022-10-13 06:53:56,740 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (50s) 2022-10-13 06:54:01,811 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (55s) 2022-10-13 06:54:06,866 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (60s) 2022-10-13 06:54:07,866 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Cluster shutdown completed. 2022-10-13 06:54:07,866 o.a.f.k.o.c.FlinkDeploymentController [INFO ][load-streaming/validator-process-124] End of reconciliation 2022-10-13 06:54:07,894 o.a.f.k.o.c.FlinkDeploymentController [INFO ][load-streaming/validator-process-124] Starting reconciliation 2022-10-13 06:54:07,894 o.a.f.k.o.o.d.ApplicationObserver [WARN ][load-streaming/validator-process-124] Running deployment generation 3 doesn't match upgrade target generation 4. 2022-10-13 06:54:07,895 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler [INFO ][load-streaming/validator-process-124] Detected spec change, starting reconciliation. 2022-10-13 06:54:07,941 o.a.f.k.o.s.FlinkService [INFO ][load-streaming/validator-process-124] Deploying application cluster 2022-10-13 06:54:07,947 o.a.f.k.o.u.FlinkUtils [INFO ][load-streaming/validator-process-124] Job graph in ConfigMap validator-process-124-dispatcher-leader is deleted 2022-10-13 06:54:08,029 o.a.f.c.d.a.c.ApplicationClusterDeployer [INFO ][load-streaming/validator-process-124] Submitting application in 'Application Mode'. 2022-10-13 06:54:08,031 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO ][load-streaming/validator-process-124] The derived from fraction jvm overhead memory (102.400mb (107374184 bytes)) is less than its min value 192.000mb (201326592 bytes), min value will be used instead 2022-10-13 06:54:08,087 o.a.f.k.o.r.ReconciliationUtils [WARN ][load-streaming/validator-process-124] Attempt count: 0, last attempt: false 2022-10-13 06:54:08,111 i.j.o.p.e.ReconciliationDispatcher [ERROR][load-streaming/validator-process-124] Error during event processing ExecutionScope{ resource id: ResourceID{name='validator-process-124', namespace='load-streaming'}, version: 1116792084} failed. org.apache.flink.kubernetes.operator.exception.ReconciliationException: org.apache.flink.client.deployment.ClusterDeploymentException: The Flink cluster validator-process-124 already exists. at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:119) at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:54) at io.javaoperatorsdk.operator.processing.Controller$2.execute(Controller.java:201) at io.javaoperatorsdk.operator.processing.Controller$2.execute(Controller.java:153) at org.apache.flink.kubernetes.operator.metrics.OperatorJosdkMetrics.timeControllerExecution(OperatorJosdkMetrics.java:83) at io.javaoperatorsdk.operator.processing.Controller.reconcile(Controller.java:152) at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.reconcileExecution(ReconciliationDispatcher.java:135) at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:115) at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:86) at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:59) at io.javaoperatorsdk.operator.processing.event.EventProcessor$ControllerExecution.run(EventProcessor.java:390) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Caused by: org.apache.flink.client.deployment.ClusterDeploymentException: The Flink cluster validator-process-124 already exists. at org.apache.flink.kubernetes.KubernetesClusterDescriptor.deployApplicationCluster(KubernetesClusterDescriptor.java:181) at org.apache.flink.client.deployment.application.cli.ApplicationClusterDeployer.run(ApplicationClusterDeployer.java:67) at org.apache.flink.kubernetes.operator.service.FlinkService.submitApplicationCluster(FlinkService.java:200) at org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.deploy(ApplicationReconciler.java:155) at org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.deploy(ApplicationReconciler.java:52) at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractJobReconciler.restoreJob(AbstractJobReconciler.java:188) at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractJobReconciler.reconcileSpecChange(AbstractJobReconciler.java:122) at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractFlinkResourceReconciler.reconcile(AbstractFlinkResourceReconciler.java:145) at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractFlinkResourceReconciler.reconcile(AbstractFlinkResourceReconciler.java:55) at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:115) ... 13 more
In this scenario I see that the job manager pod is restarted (not recreated), task manager pods are not updated, flink config maps are not updated.
The flink deployment state changes to UPGRADING and the above exception is repeated.
error in flink deployment: org.apache.flink.client.deployment.ClusterDeploymentException: The Flink cluster validator-process-124 already exists.
Job Manager Deployment Status: MISSING
Flink deployment spec:
flinkVersion: v1_14 job: allowNonRestoredState: true args: ... entryClass: ... jarURI: ... parallelism: x savepointTriggerNonce: 0 state: running upgradeMode: savepoint jobManager: podTemplate: apiVersion: v1 kind: Pod metadata: annotations: configmap.reloader.stakater.com/reload: flink-config-validator-process-124,pod-template-validator-process-124 spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: nodeType operator: In values: - someValue containers: - name: flink-main-container resources: limits: cpu: "1" memory: 1.6Gi requests: cpu: "0.2" memory: 1Gi tolerations: - effect: NoSchedule key: someValue value: "true" replicas: 1 podTemplate: apiVersion: v1 kind: Pod metadata: annotations: configmap.reloader.stakater.com/reload: flink-config-validator-process-124,pod-template-validator-process-124 prometheus.io/path: /metrics prometheus.io/port: "9260" prometheus.io/scrape: "true" labels: app.kubernetes.io/instance: flink-validator-process-124 app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: apache-flink app.kubernetes.io/version: test helm.sh/chart: apache-flink-1.0.0 spec: containers: [] imagePullSecrets: [] serviceAccount: validator-process-124 taskManager: podTemplate: apiVersion: v1 kind: Pod metadata: annotations: configmap.reloader.stakater.com/reload: flink-config-validator-process-124,pod-template-validator-process-124 spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: nodeType operator: In values: - someValue containers: - name: flink-main-container resources: limits: cpu: "1" memory: 3.6Gi requests: cpu: "0.2" memory: 3Gi tolerations: - effect: NoSchedule key: someValue value: "true"
Please let me know if more details are required.