[YUNIKORN-1161] Pods not linked to placeholders are stuck in Running state if YK is restarted - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.0.0
Component/s: shim - kubernetes
Labels:
- pull-request-available

Description

If we create pods where the name of the task group does not match the task-group-name annotation, then the real pods will not transition to Running state when the placeholder pods expire and Yunikorn was restarted in the meantime.

For example, extend the sleep batch job like that:

apiVersion: batch/v1
kind: Job
metadata:
  name: batch-sleep-job-9
spec:
  completions: 5
  parallelism: 5
  template:
    metadata:
      labels:
        app: sleep
        applicationId: "batch-sleep-job-9"
        queue: root.sandbox
      annotations:
        yunikorn.apache.org/task-group-name: sleep-groupxxx
        yunikorn.apache.org/task-groups: |-
          [{
              "name": "sleep-group",
              "minMember": 5,
              "minResource": {
                "cpu": "100m",
                "memory": "2000M"
              },
              "nodeSelector": {},
              "tolerations": []
          }]
...

Submit the job and restart Yunikorn when the placeholders are already running.
This will result in "batch-sleep-job-9-nnnnn" pods that are not transitioning to Running and they have to be manually terminated.

$ kubectl get pods -A | grep -E "(batch-sleep-job-9|yunikorn)"
default                batch-sleep-job-9-hgxxl                          0/1     Pending     0          20m
default                batch-sleep-job-9-j6twt                          0/1     Pending     0          20m
default                batch-sleep-job-9-l4jhm                          0/1     Pending     0          20m
default                batch-sleep-job-9-swlm4                          0/1     Pending     0          20m
default                batch-sleep-job-9-z6wqx                          0/1     Pending     0          20m
default                yunikorn-admission-controller-78c775cfd9-6pp8d   1/1     Running     4          3d22h
default                yunikorn-scheduler-77dd7c665b-f8kkn              2/2     Running     0          18m

Note that without YK restart, they are deallocated and removed properly.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

logs-from-yunikorn-scheduler-k8s-in-yunikorn-scheduler-after_restart_nomatchingtaskgroupname.txt
29/Mar/22 10:20
45 kB
Peter Bacsko
logs-from-yunikorn-scheduler-k8s-in-yunikorn-scheduler-before_restart_nomatchingtaskgroupname.txt
29/Mar/22 10:20
36 kB
Peter Bacsko
pods_nomatchingtaskgroupname.txt
29/Mar/22 10:20
5 kB
Peter Bacsko

Issue Links

causes

YUNIKORN-1182 Fix YUNIKORN-1161 and YUNIKORN-1155 properly

Closed

YUNIKORN-1180 JSON parse error when creating placeholders

Closed

is related to

YUNIKORN-1169 Fix ApplicationMetadata restoration during recovery

Closed

links to

GitHub Pull Request #403

Activity

People

Assignee:: Peter Bacsko

Reporter:: Peter Bacsko

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 29/Mar/22 08:38

Updated:: 04/May/22 23:52

Resolved:: 04/Apr/22 17:11