Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-32774

Reconciliation for autoscaling overrides gets stuck after cancel-with-savepoint

Agile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      Since https://issues.apache.org/jira/browse/FLINK-32589 the operator does not rely on the Flink configuration anymore to store the parallelism overrides. Instead, it stores them internally in the autoscaler config map. Upon scalings without the rescaling API, the spec is changed on the fly during reconciliation and the parallelism overrides are added.

      Unfortunately, this yields to the cluster getting stuck with the job in FINISHED state after taking a savepoint for upgrade. The operator assumes that the new cluster got deployed successfully and goes into DEPLOYED state again.

      Log flow (from oldest to newest):

      1. Rescheduling new reconciliation immediately to execute scaling operation.
      2. Upgrading/Restarting running job, suspending first...
      3. Job is in running state, ready for upgrade with SAVEPOINT
      4. Suspending existing deployment.
      5. Suspending job with savepoint.
      6. Job successfully suspended with savepoint
      7. The resource is being upgraded
      8. Pending upgrade is already deployed, updating status.
      9. Observing JobManager deployment. Previous status: DEPLOYING
      10. JobManager deployment port is ready, waiting for the Flink REST API...
      11. DEPLOYED The resource is deployed/submitted to Kubernetes, but it’s not yet considered to be stable and might be rolled back in the future

      It appears the issue might be in (8): https://github.com/apache/flink-kubernetes-operator/blob/c09671c5c51277c266b8c45d493317d3be1324c0/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/deployment/AbstractFlinkDeploymentObserver.java#L260 because the generation id hasn't been changed by the mere parallelism override change.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            gyfora Gyula Fora
            mxm Maximilian Michels
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment