Details
-
New Feature
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.5.1
-
None
-
None
Description
We had a node broken in the cluster - kubernetes was creating pods which were immediately failing with "OutOfGPU" state. The node had 1000+ pods on it.
The scheduler panicked with the log attached and was not scheduling any other pods.
The config:
apiVersion: v1 data: admissionController.filtering.bypassNamespaces: ^kube-system$,^rook$,^rook-east$,^rook-central$,^rook-pacific$,^rook-south-east$,^rook-system$ queues.yaml: | partitions: - name: default placementrules: - name: fixed value: root.scavenging.osg create: true filter: type: allow users: - system:serviceaccount:osg-ligo:prp-htcondor-provisioner - system:serviceaccount:osg-opportunistic:prp-htcondor-provisioner - system:serviceaccount:osg-icecube:prp-htcondor-provisioner - name: tag value: namespace create: true parent: name: tag value: namespace.parentqueue - name: tag value: namespace create: true parent: name: fixed value: general nodesortpolicy: type: fair resourceweights: vcore: 1.0 memory: 1.0 nvidia.com/gpu: 4.0 queues: - name: root submitacl: '*' properties: application.sort.policy: fair queues: - name: system parent: true properties: preemption.policy: disabled - name: general parent: true childtemplate: properties: application.sort.policy: fair resources: guaranteed: vcore: 100 memory: 1Ti nvidia.com/gpu: 8 max: vcore: 4000 memory: 15Ti nvidia.com/gpu: 200 - name: scavenging parent: true childtemplate: resources: guaranteed: vcore: 1 memory: 1G nvidia.com/gpu: 1 properties: priority.offset: "-10" - name: interactive parent: true childtemplate: resources: guaranteed: vcore: 1000 memory: 10T nvidia.com/gpu: 48 nvidia.com/a100: 4 properties: priority.offset: "10" preemption.policy: disabled - name: clemson parent: true properties: application.sort.policy: fair resources: guaranteed: vcore: 256 memory: 2T nvidia.com/gpu: 24 - name: nysernet parent: true properties: application.sort.policy: fair resources: guaranteed: vcore: 1000 memory: 5T nvidia.com/gpu: 16 - name: gpn parent: true properties: application.sort.policy: fair resources: guaranteed: vcore: 5000 memory: 50T nvidia.com/gpu: 256 nvidia.com/a100: 16 - name: sdsu parent: true properties: application.sort.policy: fair resources: guaranteed: vcore: 1000 memory: 15T nvidia.com/gpu: 112 nvidia.com/a100: 64 queues: - name: sdsu-jupyterhub parent: false properties: preemption.policy: disabled priority.offset: "10" resources: guaranteed: vcore: 700 memory: 5T nvidia.com/gpu: 100 - name: tide parent: true properties: application.sort.policy: fair resources: guaranteed: vcore: 592 memory: 15T nvidia.com/gpu: 72 queues: - name: rook-tide parent: false properties: preemption.policy: disabled priority.offset: "10" resources: guaranteed: vcore: 500 memory: 1T - name: ucsc parent: true properties: application.sort.policy: fair resources: guaranteed: vcore: 500 memory: 4T nvidia.com/gpu: 256 - name: ucsd parent: true properties: application.sort.policy: fair resources: guaranteed: vcore: 40000 memory: 40T nvidia.com/gpu: 512 nvidia.com/a100: 100 queues: - name: ry parent: true properties: application.sort.policy: fair resources: guaranteed: vcore: 512 memory: 8T nvidia.com/gpu: 144 - name: suncave parent: false properties: preemption.policy: disabled priority.offset: "10" resources: guaranteed: vcore: 1000 memory: 1T - name: dimm parent: false properties: preemption.policy: disabled priority.offset: "1000" resources: guaranteed: vcore: 1000 memory: 1T - name: haosu parent: true properties: application.sort.policy: fair resources: guaranteed: vcore: 5000 memory: 10T nvidia.com/gpu: 120 queues: - name: rook-haosu parent: false properties: preemption.policy: disabled priority.offset: "10" resources: guaranteed: vcore: 1000 memory: 1T kind: ConfigMap metadata: creationTimestamp: "2023-12-21T06:09:12Z" name: yunikorn-configs namespace: yunikorn resourceVersion: "7764804169" uid: 5b9b2c04-57af-4cab-84f8-b5f018952f9c
Attachments
Attachments
Issue Links
- fixes
-
YUNIKORN-2804 [Umbrella] Rethink general retry policy for post allocation failed task
- Open