Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-32334

Operator failed to create taskmanager deployment because it already exist

    XMLWordPrintableJSON

Details

    Description

      During a job upgrade the operator has failed to start the new job because it has failed to create the taskmanager deployment:

       

      Jun 12 19:45:28.115 >>> Status | Error | UPGRADING | {"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"org.apache.flink.client.deployment.ClusterDeploymentException: Could not create Kubernetes cluster \"flink-metering\".","throwableList":[{"type":"org.apache.flink.client.deployment.ClusterDeploymentException","message":"Could not create Kubernetes cluster \"flink-metering\"."},{"type":"org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException","message":"Failure executing: POST at: https://10.129.144.1/apis/apps/v1/namespaces/metering/deployments. Message: object is being deleted: deployments.apps \"flink-metering-taskmanager\" already exists. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=apps, kind=deployments, name=flink-metering-taskmanager, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=object is being deleted: deployments.apps \"flink-metering-taskmanager\" already exists, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=AlreadyExists, status=Failure, additionalProperties={})."}]} 

      As indicated in the error log this is due to taskmanger deployment still existing while it is under deletion.

      Looking at the source code we are well relying on FOREGROUND policy by default.

      Still it seems that the REST API call to delete only wait until the resource has been modified and the deletionTimestamp has been added to the metadata: ensure delete returns only when the delete operation is fully finished - Issue #3246 - fabric8io/kubernetes-client

      So we could face this issue if the k8s cluster is slow to "really" delete the deployment

       

      Attachments

        Issue Links

          Activity

            People

              nfraison.datadog Nicolas Fraison
              nfraison.datadog Nicolas Fraison
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: