Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-32334

Operator failed to create taskmanager deployment because it already exist

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      During a job upgrade the operator has failed to start the new job because it has failed to create the taskmanager deployment:

       

      Jun 12 19:45:28.115 >>> Status | Error | UPGRADING | {"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"org.apache.flink.client.deployment.ClusterDeploymentException: Could not create Kubernetes cluster \"flink-metering\".","throwableList":[{"type":"org.apache.flink.client.deployment.ClusterDeploymentException","message":"Could not create Kubernetes cluster \"flink-metering\"."},{"type":"org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException","message":"Failure executing: POST at: https://10.129.144.1/apis/apps/v1/namespaces/metering/deployments. Message: object is being deleted: deployments.apps \"flink-metering-taskmanager\" already exists. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=apps, kind=deployments, name=flink-metering-taskmanager, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=object is being deleted: deployments.apps \"flink-metering-taskmanager\" already exists, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=AlreadyExists, status=Failure, additionalProperties={})."}]} 

      As indicated in the error log this is due to taskmanger deployment still existing while it is under deletion.

      Looking at the source code we are well relying on FOREGROUND policy by default.

      Still it seems that the REST API call to delete only wait until the resource has been modified and the deletionTimestamp has been added to the metadata: ensure delete returns only when the delete operation is fully finished - Issue #3246 - fabric8io/kubernetes-client

      So we could face this issue if the k8s cluster is slow to "really" delete the deployment

       

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            nfraison.datadog Nicolas Fraison
            nfraison.datadog Nicolas Fraison
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment