[YUNIKORN-2645] Rate limit pod allocations on nodes - ASF JIRA

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.5.1
Fix Version/s: None
Component/s: core - scheduler
Labels:
None

Target Version:

1.7.0

Description

We had a node broken in the cluster - kubernetes was creating pods which were immediately failing with "OutOfGPU" state. The node had 1000+ pods on it.

The scheduler panicked with the log attached and was not scheduling any other pods.

The config:

apiVersion: v1
data: 
  admissionController.filtering.bypassNamespaces: ^kube-system$,^rook$,^rook-east$,^rook-central$,^rook-pacific$,^rook-south-east$,^rook-system$
  queues.yaml: |
    partitions: 
      - name: default
        placementrules: 
          - name: fixed
            value: root.scavenging.osg
            create: true
            filter: 
              type: allow
              users: 
              - system:serviceaccount:osg-ligo:prp-htcondor-provisioner
              - system:serviceaccount:osg-opportunistic:prp-htcondor-provisioner
              - system:serviceaccount:osg-icecube:prp-htcondor-provisioner
          - name: tag
            value: namespace
            create: true
            parent: 
               name: tag
               value: namespace.parentqueue
          - name: tag
            value: namespace
            create: true
            parent: 
               name: fixed
               value: general
        nodesortpolicy: 
          type: fair
          resourceweights: 
            vcore: 1.0
            memory: 1.0
            nvidia.com/gpu: 4.0
        queues: 
          - name: root
            submitacl: '*'
            properties: 
              application.sort.policy: fair
            queues: 
            - name: system
              parent: true
              properties: 
                preemption.policy: disabled
            - name: general
              parent: true
              childtemplate: 
                properties: 
                  application.sort.policy: fair
                resources: 
                  guaranteed: 
                    vcore: 100
                    memory: 1Ti
                    nvidia.com/gpu: 8
                  max: 
                    vcore: 4000
                    memory: 15Ti
                    nvidia.com/gpu: 200
            - name: scavenging
              parent: true
              childtemplate: 
                resources: 
                  guaranteed: 
                    vcore: 1
                    memory: 1G
                    nvidia.com/gpu: 1
                properties: 
                  priority.offset: "-10"
            - name: interactive
              parent: true
              childtemplate: 
                resources: 
                  guaranteed: 
                    vcore: 1000
                    memory: 10T
                    nvidia.com/gpu: 48
                    nvidia.com/a100: 4
                properties: 
                  priority.offset: "10"
                  preemption.policy: disabled
            - name: clemson
              parent: true
              properties: 
                application.sort.policy: fair
              resources: 
                guaranteed: 
                  vcore: 256
                  memory: 2T
                  nvidia.com/gpu: 24
            - name: nysernet
              parent: true
              properties: 
                application.sort.policy: fair
              resources: 
                guaranteed: 
                  vcore: 1000
                  memory: 5T
                  nvidia.com/gpu: 16
            - name: gpn
              parent: true
              properties: 
                application.sort.policy: fair
              resources: 
                guaranteed: 
                  vcore: 5000
                  memory: 50T
                  nvidia.com/gpu: 256
                  nvidia.com/a100: 16
            - name: sdsu
              parent: true
              properties: 
                application.sort.policy: fair
              resources: 
                guaranteed: 
                  vcore: 1000
                  memory: 15T
                  nvidia.com/gpu: 112
                  nvidia.com/a100: 64
              queues: 
              - name: sdsu-jupyterhub
                parent: false
                properties: 
                  preemption.policy: disabled
                  priority.offset: "10"
                resources: 
                  guaranteed: 
                    vcore: 700
                    memory: 5T
                    nvidia.com/gpu: 100
            - name: tide
              parent: true
              properties: 
                application.sort.policy: fair
              resources: 
                guaranteed: 
                  vcore: 592
                  memory: 15T
                  nvidia.com/gpu: 72
              queues: 
              - name: rook-tide
                parent: false
                properties: 
                  preemption.policy: disabled
                  priority.offset: "10"
                resources: 
                  guaranteed: 
                    vcore: 500
                    memory: 1T
            - name: ucsc
              parent: true
              properties: 
                application.sort.policy: fair
              resources: 
                guaranteed: 
                  vcore: 500
                  memory: 4T
                  nvidia.com/gpu: 256
            - name: ucsd
              parent: true
              properties: 
                application.sort.policy: fair
              resources: 
                guaranteed: 
                  vcore: 40000
                  memory: 40T
                  nvidia.com/gpu: 512
                  nvidia.com/a100: 100
              queues: 
              - name: ry
                parent: true
                properties: 
                  application.sort.policy: fair
                resources: 
                  guaranteed: 
                    vcore: 512
                    memory: 8T
                    nvidia.com/gpu: 144
              - name: suncave
                parent: false
                properties: 
                  preemption.policy: disabled
                  priority.offset: "10"
                resources: 
                  guaranteed: 
                    vcore: 1000
                    memory: 1T
              - name: dimm
                parent: false
                properties: 
                  preemption.policy: disabled
                  priority.offset: "1000"
                resources: 
                  guaranteed: 
                    vcore: 1000
                    memory: 1T
              - name: haosu
                parent: true
                properties: 
                  application.sort.policy: fair
                resources: 
                  guaranteed: 
                    vcore: 5000
                    memory: 10T
                    nvidia.com/gpu: 120
                queues: 
                - name: rook-haosu
                  parent: false
                  properties: 
                    preemption.policy: disabled
                    priority.offset: "10"
                  resources: 
                    guaranteed: 
                      vcore: 1000
                      memory: 1T
kind: ConfigMap
metadata: 
  creationTimestamp: "2023-12-21T06:09:12Z"
  name: yunikorn-configs
  namespace: yunikorn
  resourceVersion: "7764804169"
  uid: 5b9b2c04-57af-4cab-84f8-b5f018952f9c

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

yunikorn-logs.txt.gz
28/May/24 19:12
103 kB
Dmitry

Issue Links

fixes

YUNIKORN-2804 [Umbrella] Rethink general retry policy for post allocation failed task

Open

Rate limit pod allocations on nodes

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates