Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-29620

Flink deployment stuck in UPGRADING state when changing configuration

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.14.2
    • 1.15.0
    • Kubernetes Operator
    • None
    • AWS EKS v1.21

      Operator version: 1.1.0

    Description

      When I update the configuration of a flink deployment I observe one of two scenarios:

      Success:

      This happens when the job has not started - if I change the configuration quick enough:

      2022-10-13 06:50:54,336 o.a.f.k.o.r.d.AbstractJobReconciler [INFO ][load-streaming/validator-process-124] Upgrading/Restarting running job, suspending first...
      2022-10-13 06:50:54,343 o.a.f.k.o.r.d.ApplicationReconciler [INFO ][load-streaming/validator-process-124] Job is not running but HA metadata is available for last state restore, ready for upgrade
      2022-10-13 06:50:54,353 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Deleting JobManager deployment while preserving HA metadata.
      2022-10-13 06:50:58,415 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (5s)
      2022-10-13 06:51:03,451 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (10s)
      2022-10-13 06:51:06,469 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Cluster shutdown completed.
      2022-10-13 06:51:06,470 o.a.f.k.o.c.FlinkDeploymentController [INFO ][load-streaming/validator-process-124] End of reconciliation
      2022-10-13 06:51:06,493 o.a.f.k.o.c.FlinkDeploymentController [INFO ][load-streaming/validator-process-124] Starting reconciliation
      2022-10-13 06:51:06,494 o.a.f.k.o.c.FlinkConfigManager [INFO ][load-streaming/validator-process-124] Generating new config
       

      In this scenario I see that the job manager and task manager pods are terminated and then recreated.

       

       

      Failure:

      This happens when I let the job start (wait more than 30-60 seconds) and change the configuration:

      2022-10-13 06:53:06,637 o.a.f.k.o.r.d.AbstractJobReconciler [INFO ][load-streaming/validator-process-124] Upgrading/Restarting running job, suspending first...
      2022-10-13 06:53:06,637 o.a.f.k.o.r.d.AbstractJobReconciler [INFO ][load-streaming/validator-process-124] Job is in running state, ready for upgrade with SAVEPOINT
      2022-10-13 06:53:06,659 o.a.f.k.o.s.FlinkService       [INFO ][load-streaming/validator-process-124] Suspending job with savepoint.
      2022-10-13 06:53:07,042 o.a.f.k.o.s.FlinkService       [INFO ][load-streaming/validator-process-124] Job successfully suspended with savepoint s3://cu-flink-load-checkpoints-us-east-1/validator-process-124/savepoints/savepoint-000000-947975b509b2.
      2022-10-13 06:53:11,111 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (5s)
      2022-10-13 06:53:16,176 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (10s)
      2022-10-13 06:53:21,238 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (15s)
      2022-10-13 06:53:26,293 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (20s)
      2022-10-13 06:53:31,355 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (25s)
      2022-10-13 06:53:36,412 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (30s)
      2022-10-13 06:53:41,512 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (35s)
      2022-10-13 06:53:46,568 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (40s)
      2022-10-13 06:53:51,625 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (45s)
      2022-10-13 06:53:56,740 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (50s)
      2022-10-13 06:54:01,811 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (55s)
      2022-10-13 06:54:06,866 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Waiting for cluster shutdown... (60s)
      2022-10-13 06:54:07,866 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Cluster shutdown completed.
      2022-10-13 06:54:07,866 o.a.f.k.o.c.FlinkDeploymentController [INFO ][load-streaming/validator-process-124] End of reconciliation
      2022-10-13 06:54:07,894 o.a.f.k.o.c.FlinkDeploymentController [INFO ][load-streaming/validator-process-124] Starting reconciliation
      2022-10-13 06:54:07,894 o.a.f.k.o.o.d.ApplicationObserver [WARN ][load-streaming/validator-process-124] Running deployment generation 3 doesn't match upgrade target generation 4.
      2022-10-13 06:54:07,895 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler [INFO ][load-streaming/validator-process-124] Detected spec change, starting reconciliation.
      2022-10-13 06:54:07,941 o.a.f.k.o.s.FlinkService       [INFO ][load-streaming/validator-process-124] Deploying application cluster
      2022-10-13 06:54:07,947 o.a.f.k.o.u.FlinkUtils         [INFO ][load-streaming/validator-process-124] Job graph in ConfigMap validator-process-124-dispatcher-leader is deleted
      2022-10-13 06:54:08,029 o.a.f.c.d.a.c.ApplicationClusterDeployer [INFO ][load-streaming/validator-process-124] Submitting application in 'Application Mode'.
      2022-10-13 06:54:08,031 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO ][load-streaming/validator-process-124] The derived from fraction jvm overhead memory (102.400mb (107374184 bytes)) is less than its min value 192.000mb (201326592 bytes), min value will be used instead
      2022-10-13 06:54:08,087 o.a.f.k.o.r.ReconciliationUtils [WARN ][load-streaming/validator-process-124] Attempt count: 0, last attempt: false
      2022-10-13 06:54:08,111 i.j.o.p.e.ReconciliationDispatcher [ERROR][load-streaming/validator-process-124] Error during event processing ExecutionScope{ resource id: ResourceID{name='validator-process-124', namespace='load-streaming'}, version: 1116792084} failed.
      org.apache.flink.kubernetes.operator.exception.ReconciliationException: org.apache.flink.client.deployment.ClusterDeploymentException: The Flink cluster validator-process-124 already exists.
          at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:119)
          at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:54)
          at io.javaoperatorsdk.operator.processing.Controller$2.execute(Controller.java:201)
          at io.javaoperatorsdk.operator.processing.Controller$2.execute(Controller.java:153)
          at org.apache.flink.kubernetes.operator.metrics.OperatorJosdkMetrics.timeControllerExecution(OperatorJosdkMetrics.java:83)
          at io.javaoperatorsdk.operator.processing.Controller.reconcile(Controller.java:152)
          at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.reconcileExecution(ReconciliationDispatcher.java:135)
          at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:115)
          at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:86)
          at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:59)
          at io.javaoperatorsdk.operator.processing.event.EventProcessor$ControllerExecution.run(EventProcessor.java:390)
          at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
          at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
          at java.base/java.lang.Thread.run(Unknown Source)
      Caused by: org.apache.flink.client.deployment.ClusterDeploymentException: The Flink cluster validator-process-124 already exists.
          at org.apache.flink.kubernetes.KubernetesClusterDescriptor.deployApplicationCluster(KubernetesClusterDescriptor.java:181)
          at org.apache.flink.client.deployment.application.cli.ApplicationClusterDeployer.run(ApplicationClusterDeployer.java:67)
          at org.apache.flink.kubernetes.operator.service.FlinkService.submitApplicationCluster(FlinkService.java:200)
          at org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.deploy(ApplicationReconciler.java:155)
          at org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.deploy(ApplicationReconciler.java:52)
          at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractJobReconciler.restoreJob(AbstractJobReconciler.java:188)
          at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractJobReconciler.reconcileSpecChange(AbstractJobReconciler.java:122)
          at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractFlinkResourceReconciler.reconcile(AbstractFlinkResourceReconciler.java:145)
          at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractFlinkResourceReconciler.reconcile(AbstractFlinkResourceReconciler.java:55)
          at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:115)
          ... 13 more 

      In this scenario I see that the job manager pod is restarted (not recreated), task manager pods are not updated, flink config maps are not updated.

      The flink deployment state changes to UPGRADING and the above exception is repeated.

      error in flink deployment: org.apache.flink.client.deployment.ClusterDeploymentException: The Flink cluster validator-process-124 already exists.
      Job Manager Deployment Status:  MISSING

       

      Flink deployment spec:

       

      flinkVersion: v1_14 
      job:
          allowNonRestoredState: true
          args: ...
          entryClass: ...
          jarURI: ...
          parallelism: x
          savepointTriggerNonce: 0
          state: running
          upgradeMode: savepoint
      jobManager:
          podTemplate:
            apiVersion: v1
            kind: Pod
            metadata:
              annotations:
                configmap.reloader.stakater.com/reload: flink-config-validator-process-124,pod-template-validator-process-124
            spec:
              affinity:
                nodeAffinity:
                  requiredDuringSchedulingIgnoredDuringExecution:
                    nodeSelectorTerms:
                    - matchExpressions:
                      - key: nodeType
                        operator: In
                        values:
                        - someValue
              containers:
              - name: flink-main-container
                resources:
                  limits:
                    cpu: "1"
                    memory: 1.6Gi
                  requests:
                    cpu: "0.2"
                    memory: 1Gi
              tolerations:
              - effect: NoSchedule
                key: someValue
                value: "true"
          replicas: 1
      podTemplate:
          apiVersion: v1
          kind: Pod
          metadata:
            annotations:
              configmap.reloader.stakater.com/reload: flink-config-validator-process-124,pod-template-validator-process-124
              prometheus.io/path: /metrics
              prometheus.io/port: "9260"
              prometheus.io/scrape: "true"
            labels:
              app.kubernetes.io/instance: flink-validator-process-124
              app.kubernetes.io/managed-by: Helm
              app.kubernetes.io/name: apache-flink
              app.kubernetes.io/version: test
              helm.sh/chart: apache-flink-1.0.0
          spec:
            containers: []
            imagePullSecrets: []
        serviceAccount: validator-process-124
        taskManager:
          podTemplate:
            apiVersion: v1
            kind: Pod
            metadata:
              annotations:
                configmap.reloader.stakater.com/reload: flink-config-validator-process-124,pod-template-validator-process-124
            spec:
              affinity:
                nodeAffinity:
                  requiredDuringSchedulingIgnoredDuringExecution:
                    nodeSelectorTerms:
                    - matchExpressions:
                      - key: nodeType
                        operator: In
                        values:
                        - someValue
              containers:
              - name: flink-main-container
                resources:
                  limits:
                    cpu: "1"
                    memory: 3.6Gi
                  requests:
                    cpu: "0.2"
                    memory: 3Gi
              tolerations:
              - effect: NoSchedule
                key: someValue
                value: "true"

       

      Please let me know if more details are required.

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            liadsh liad shachoach
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: