[YUNIKORN-2731] YuniKorn stopped scheduling new containers with negative vcore in queue - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.5.1
Fix Version/s: 1.5.2
Component/s: core - scheduler
Labels:
None

Description

We have encountered this issue in one of our clusters every a few days. We are running a version that is built from branch https://github.com/apache/yunikorn-k8shim/commits/branch-1.5/ commit fb4e3f11345e6a9866dfaea97770c94b9421807b.

Here is our configuration of queues.yaml.

partitions:
  - name: default
    nodesortpolicy:
      type: binpacking
    preemption:
      enabled: false
    placementrules:
      - name: tag
        value: namespace
        create: false
    queues:
      - name: root
        submitacl: '*'
        queues:
          - name: c
            resources:
              guaranteed:
                memory: 13000Gi
                vcore: 3250
              max:
                memory: 13000Gi
                vcore: 3250
            properties:
              application.sort.policy: fair
          - name: e
            resources:
              guaranteed:
                memory: 2600Gi
                vcore: 650
              max:
                memory: 2600Gi
                vcore: 650
            properties:
              application.sort.policy: fair
          - name: m1
            resources:
              guaranteed:
                memory: 1000Gi
                vcore: 250
              max:
                memory: 1000Gi
                vcore: 250
            properties:
              application.sort.policy: fair
          - name: m2
            resources:
              guaranteed:
                memory: 62000Gi
                vcore: 15500
              max:
                memory: 62000Gi
                vcore: 15500
            properties:
              application.sort.policy: fair

The issue is that at some point the scheduler would stop starting new containers, and there would be 0 containers running finally and lots of applications in Accepted status.

There are some logs that contains negative vcore resource, and these logs are highly corralated with this issue in timeline.

2024-07-08T10:19:13.436Z    INFO    core.scheduler    scheduler/scheduler.go:101    Found outstanding requests that will trigger autoscaling    {"number of requests": 1, "total resources": "map[memory:2147483648 pods:1 vcore:500]"}
E0708 10:19:13.604563       1 event_broadcaster.go:270] "Server rejected event (will not retry!)" err="Event \"example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76\" is invalid: [action: Required value, reason: Required value]" event="&Event{ObjectMeta:{example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76  c    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},EventTime:2024-07-08 10:19:13.60205325 +0000 UTC m=+524778.657618517,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod c example-job-1720433945-574-aa32179091daba13-driver a1901455-3a23-4dbc-bbb2-9dd2dfc775ea v1 201821875 },Related:nil,Note:Request 'a1901455-3a23-4dbc-bbb2-9dd2dfc775ea' does not fit in queue 'root.c' (requested map[memory:2147483648 pods:1 vcore:500], available map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 vcore:-56150]),Type:Normal,DeprecatedSource:{ },DeprecatedFirstTimestamp:0001-01-01 00:00:00 +0000 UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 +0000 UTC,DeprecatedCount:0,}"

2024-07-08T10:19:05.391Z    INFO    core.scheduler    scheduler/scheduler.go:101    Found outstanding requests that will trigger autoscaling    {"number of requests": 1, "total resources": "map[memory:2147483648 pods:1 vcore:500]"}
E0708 10:19:05.601679       1 event_broadcaster.go:270] "Server rejected event (will not retry!)" err="Event \"example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4\" is invalid: [action: Required value, reason: Required value]" event="&Event{ObjectMeta:{example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4  e    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},EventTime:2024-07-08 10:19:05.599216316 +0000 UTC m=+524770.654781585,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod e example-job-1720433937-295-e7b2229091da99a7-driver 14a40bac-5e89-4293-bcb7-936c544694a2 v1 201821666 },Related:nil,Note:Request '14a40bac-5e89-4293-bcb7-936c544694a2' does not fit in queue 'root.e' (requested map[memory:2147483648 pods:1 vcore:500], available map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 vcore:-56150]),Type:Normal,DeprecatedSource:{ },DeprecatedFirstTimestamp:0001-01-01 00:00:00 +0000 UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 +0000 UTC,DeprecatedCount:0,}"

2024-07-08T10:18:51.325Z    INFO    core.scheduler    scheduler/scheduler.go:101    Found outstanding requests that will trigger autoscaling    {"number of requests": 1, "total resources": "map[memory:2147483648 pods:1 vcore:500]"}
E0708 10:18:51.596390       1 event_broadcaster.go:270] "Server rejected event (will not retry!)" err="Event \"example-job-1720433923-763-378d629091da6500-driver.17e0358ba82f71a5\" is invalid: [action: Required value, reason: Required value]" event="&Event{ObjectMeta:{example-job-1720433923-763-378d629091da6500-driver.17e0358ba82f71a5  m1    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},EventTime:2024-07-08 10:18:51.593930204 +0000 UTC m=+524756.649495472,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod m1 example-job-1720433923-763-378d629091da6500-driver f0c19c6a-6eb5-4e68-808d-389862c197cb v1 201821358 },Related:nil,Note:Request 'f0c19c6a-6eb5-4e68-808d-389862c197cb' does not fit in queue 'root.m1' (requested map[memory:2147483648 pods:1 vcore:500], available map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 vcore:-56150]),Type:Normal,DeprecatedSource:{ },DeprecatedFirstTimestamp:0001-01-01 00:00:00 +0000 UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 +0000 UTC,DeprecatedCount:0,}"

2024-07-08T10:18:03.231Z    INFO    shim.context    cache/context.go:1139    app request originating pod added    {"appID": "spark-26e1b4f9c3124376aad12a9b63c8b711", "original task": "ffdf1356-4a7f-4559-9cbd-afa510f96cfe"}
E0708 10:18:03.584031       1 event_broadcaster.go:270] "Server rejected event (will not retry!)" err="Event \"another-example-job-1720433872-277-38b8b09091d9a492-driver.17e035807a6bb0df\" is invalid: [action: Required value, reason: Required value]" event="&Event{ObjectMeta:{another-example-job-1720433872-277-38b8b09091d9a492-driver.17e035807a6bb0df  m2    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},EventTime:2024-07-08 10:18:03.581485338 +0000 UTC m=+524708.637050606,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod m2 another-example-job-1720433872-277-38b8b09091d9a492-driver 9b99dd53-cd1d-48b4-a8e3-c0c58f98a503 v1 201820328 },Related:nil,Note:Request '9b99dd53-cd1d-48b4-a8e3-c0c58f98a503' does not fit in queue 'root.m2' (requested map[memory:3758096384 pods:1 vcore:500], available map[ephemeral-storage:2490103866211 hugepages-1Gi:0 hugepages-2Mi:0 hugepages-32Mi:0 hugepages-64Ki:0 memory:352023250073 pods:1635 vcore:-87850]),Type:Normal,DeprecatedSource:{ },DeprecatedFirstTimestamp:0001-01-01 00:00:00 +0000 UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 +0000 UTC,DeprecatedCount:0,}"

There are also some warnings about Scheduler is not healthy, but those logs were there before the issue started

2024-07-08T10:19:24.990Z    WARN    core.scheduler.health    scheduler/health_checker.go:178    Scheduler is not healthy    {"name": "Consistency of data", "description": "Check if a partition's allocated resource <= total resource of the partition", "message": "Partitions with inconsistent data: [\"[foo-spark]default\"]"}

YuniKorn stopped scheduling new containers with negative vcore in queue

Details

Description

Attachments

Attachments

Activity

People

Dates