Uploaded image for project: 'Apache YuniKorn'
  1. Apache YuniKorn
  2. YUNIKORN-705 [Umbrella] Gang Scheduling stabilization
  3. YUNIKORN-1161

Pods not linked to placeholders are stuck in Running state if YK is restarted

    XMLWordPrintableJSON

Details

    Description

      If we create pods where the name of the task group does not match the task-group-name annotation, then the real pods will not transition to Running state when the placeholder pods expire and Yunikorn was restarted in the meantime.

      For example, extend the sleep batch job like that:

      apiVersion: batch/v1
      kind: Job
      metadata:
        name: batch-sleep-job-9
      spec:
        completions: 5
        parallelism: 5
        template:
          metadata:
            labels:
              app: sleep
              applicationId: "batch-sleep-job-9"
              queue: root.sandbox
            annotations:
              yunikorn.apache.org/task-group-name: sleep-groupxxx
              yunikorn.apache.org/task-groups: |-
                [{
                    "name": "sleep-group",
                    "minMember": 5,
                    "minResource": {
                      "cpu": "100m",
                      "memory": "2000M"
                    },
                    "nodeSelector": {},
                    "tolerations": []
                }]
      ...
      

      Submit the job and restart Yunikorn when the placeholders are already running.
      This will result in "batch-sleep-job-9-nnnnn" pods that are not transitioning to Running and they have to be manually terminated.

      $ kubectl get pods -A | grep -E "(batch-sleep-job-9|yunikorn)"
      default                batch-sleep-job-9-hgxxl                          0/1     Pending     0          20m
      default                batch-sleep-job-9-j6twt                          0/1     Pending     0          20m
      default                batch-sleep-job-9-l4jhm                          0/1     Pending     0          20m
      default                batch-sleep-job-9-swlm4                          0/1     Pending     0          20m
      default                batch-sleep-job-9-z6wqx                          0/1     Pending     0          20m
      default                yunikorn-admission-controller-78c775cfd9-6pp8d   1/1     Running     4          3d22h
      default                yunikorn-scheduler-77dd7c665b-f8kkn              2/2     Running     0          18m
      

      Note that without YK restart, they are deallocated and removed properly.

      Attachments

        Issue Links

          Activity

            People

              pbacsko Peter Bacsko
              pbacsko Peter Bacsko
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: