Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.5.1
-
None
Description
We have encountered this issue in one of our clusters every a few days. We are running a version that is built from branch https://github.com/apache/yunikorn-k8shim/commits/branch-1.5/ commit fb4e3f11345e6a9866dfaea97770c94b9421807b.
Here is our configuration of queues.yaml.
partitions: - name: default nodesortpolicy: type: binpacking preemption: enabled: false placementrules: - name: tag value: namespace create: false queues: - name: root submitacl: '*' queues: - name: c resources: guaranteed: memory: 13000Gi vcore: 3250 max: memory: 13000Gi vcore: 3250 properties: application.sort.policy: fair - name: e resources: guaranteed: memory: 2600Gi vcore: 650 max: memory: 2600Gi vcore: 650 properties: application.sort.policy: fair - name: m1 resources: guaranteed: memory: 1000Gi vcore: 250 max: memory: 1000Gi vcore: 250 properties: application.sort.policy: fair - name: m2 resources: guaranteed: memory: 62000Gi vcore: 15500 max: memory: 62000Gi vcore: 15500 properties: application.sort.policy: fair
The issue is that at some point the scheduler would stop starting new containers, and there would be 0 containers running finally and lots of applications in Accepted status.
There are some logs that contains negative vcore resource, and these logs are highly corralated with this issue in timeline.
2024-07-08T10:19:13.436Z INFO core.scheduler scheduler/scheduler.go:101 Found outstanding requests that will trigger autoscaling {"number of requests": 1, "total resources": "map[memory:2147483648 pods:1 vcore:500]"} E0708 10:19:13.604563 1 event_broadcaster.go:270] "Server rejected event (will not retry!)" err="Event \"example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76\" is invalid: [action: Required value, reason: Required value]" event="&Event{ObjectMeta:{example-job-1720433945-574-aa32179091daba13-driver.17e03590c7f8bd76 c 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},EventTime:2024-07-08 10:19:13.60205325 +0000 UTC m=+524778.657618517,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod c example-job-1720433945-574-aa32179091daba13-driver a1901455-3a23-4dbc-bbb2-9dd2dfc775ea v1 201821875 },Related:nil,Note:Request 'a1901455-3a23-4dbc-bbb2-9dd2dfc775ea' does not fit in queue 'root.c' (requested map[memory:2147483648 pods:1 vcore:500], available map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 vcore:-56150]),Type:Normal,DeprecatedSource:{ },DeprecatedFirstTimestamp:0001-01-01 00:00:00 +0000 UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 +0000 UTC,DeprecatedCount:0,}" 2024-07-08T10:19:05.391Z INFO core.scheduler scheduler/scheduler.go:101 Found outstanding requests that will trigger autoscaling {"number of requests": 1, "total resources": "map[memory:2147483648 pods:1 vcore:500]"} E0708 10:19:05.601679 1 event_broadcaster.go:270] "Server rejected event (will not retry!)" err="Event \"example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4\" is invalid: [action: Required value, reason: Required value]" event="&Event{ObjectMeta:{example-job-1720433937-295-e7b2229091da99a7-driver.17e0358eeaf728e4 e 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},EventTime:2024-07-08 10:19:05.599216316 +0000 UTC m=+524770.654781585,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod e example-job-1720433937-295-e7b2229091da99a7-driver 14a40bac-5e89-4293-bcb7-936c544694a2 v1 201821666 },Related:nil,Note:Request '14a40bac-5e89-4293-bcb7-936c544694a2' does not fit in queue 'root.e' (requested map[memory:2147483648 pods:1 vcore:500], available map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 vcore:-56150]),Type:Normal,DeprecatedSource:{ },DeprecatedFirstTimestamp:0001-01-01 00:00:00 +0000 UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 +0000 UTC,DeprecatedCount:0,}" 2024-07-08T10:18:51.325Z INFO core.scheduler scheduler/scheduler.go:101 Found outstanding requests that will trigger autoscaling {"number of requests": 1, "total resources": "map[memory:2147483648 pods:1 vcore:500]"} E0708 10:18:51.596390 1 event_broadcaster.go:270] "Server rejected event (will not retry!)" err="Event \"example-job-1720433923-763-378d629091da6500-driver.17e0358ba82f71a5\" is invalid: [action: Required value, reason: Required value]" event="&Event{ObjectMeta:{example-job-1720433923-763-378d629091da6500-driver.17e0358ba82f71a5 m1 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},EventTime:2024-07-08 10:18:51.593930204 +0000 UTC m=+524756.649495472,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod m1 example-job-1720433923-763-378d629091da6500-driver f0c19c6a-6eb5-4e68-808d-389862c197cb v1 201821358 },Related:nil,Note:Request 'f0c19c6a-6eb5-4e68-808d-389862c197cb' does not fit in queue 'root.m1' (requested map[memory:2147483648 pods:1 vcore:500], available map[ephemeral-storage:2689798906768 hugepages-1Gi:0 hugepages-2Mi:0 hugepages-32Mi:0 hugepages-64Ki:0 memory:605510809350 pods:1715 vcore:-56150]),Type:Normal,DeprecatedSource:{ },DeprecatedFirstTimestamp:0001-01-01 00:00:00 +0000 UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 +0000 UTC,DeprecatedCount:0,}" 2024-07-08T10:18:03.231Z INFO shim.context cache/context.go:1139 app request originating pod added {"appID": "spark-26e1b4f9c3124376aad12a9b63c8b711", "original task": "ffdf1356-4a7f-4559-9cbd-afa510f96cfe"} E0708 10:18:03.584031 1 event_broadcaster.go:270] "Server rejected event (will not retry!)" err="Event \"another-example-job-1720433872-277-38b8b09091d9a492-driver.17e035807a6bb0df\" is invalid: [action: Required value, reason: Required value]" event="&Event{ObjectMeta:{another-example-job-1720433872-277-38b8b09091d9a492-driver.17e035807a6bb0df m2 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},EventTime:2024-07-08 10:18:03.581485338 +0000 UTC m=+524708.637050606,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-84cb695b5d-lr42h,Action:,Reason:,Regarding:{Pod m2 another-example-job-1720433872-277-38b8b09091d9a492-driver 9b99dd53-cd1d-48b4-a8e3-c0c58f98a503 v1 201820328 },Related:nil,Note:Request '9b99dd53-cd1d-48b4-a8e3-c0c58f98a503' does not fit in queue 'root.m2' (requested map[memory:3758096384 pods:1 vcore:500], available map[ephemeral-storage:2490103866211 hugepages-1Gi:0 hugepages-2Mi:0 hugepages-32Mi:0 hugepages-64Ki:0 memory:352023250073 pods:1635 vcore:-87850]),Type:Normal,DeprecatedSource:{ },DeprecatedFirstTimestamp:0001-01-01 00:00:00 +0000 UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 +0000 UTC,DeprecatedCount:0,}"
There are also some warnings about Scheduler is not healthy, but those logs were there before the issue started
2024-07-08T10:19:24.990Z WARN core.scheduler.health scheduler/health_checker.go:178 Scheduler is not healthy {"name": "Consistency of data", "description": "Check if a partition's allocated resource <= total resource of the partition", "message": "Partitions with inconsistent data: [\"[foo-spark]default\"]"}