Details
-
Sub-task
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
Description
If we create pods where the name of the task group does not match the task-group-name annotation, then the real pods will not transition to Running state when the placeholder pods expire and Yunikorn was restarted in the meantime.
For example, extend the sleep batch job like that:
apiVersion: batch/v1 kind: Job metadata: name: batch-sleep-job-9 spec: completions: 5 parallelism: 5 template: metadata: labels: app: sleep applicationId: "batch-sleep-job-9" queue: root.sandbox annotations: yunikorn.apache.org/task-group-name: sleep-groupxxx yunikorn.apache.org/task-groups: |- [{ "name": "sleep-group", "minMember": 5, "minResource": { "cpu": "100m", "memory": "2000M" }, "nodeSelector": {}, "tolerations": [] }] ...
Submit the job and restart Yunikorn when the placeholders are already running.
This will result in "batch-sleep-job-9-nnnnn" pods that are not transitioning to Running and they have to be manually terminated.
$ kubectl get pods -A | grep -E "(batch-sleep-job-9|yunikorn)" default batch-sleep-job-9-hgxxl 0/1 Pending 0 20m default batch-sleep-job-9-j6twt 0/1 Pending 0 20m default batch-sleep-job-9-l4jhm 0/1 Pending 0 20m default batch-sleep-job-9-swlm4 0/1 Pending 0 20m default batch-sleep-job-9-z6wqx 0/1 Pending 0 20m default yunikorn-admission-controller-78c775cfd9-6pp8d 1/1 Running 4 3d22h default yunikorn-scheduler-77dd7c665b-f8kkn 2/2 Running 0 18m
Note that without YK restart, they are deallocated and removed properly.
Attachments
Attachments
Issue Links
- causes
-
YUNIKORN-1182 Fix YUNIKORN-1161 and YUNIKORN-1155 properly
- Closed
-
YUNIKORN-1180 JSON parse error when creating placeholders
- Closed
- is related to
-
YUNIKORN-1169 Fix ApplicationMetadata restoration during recovery
- Closed
- links to