Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.5.1
-
EKS 1.29
Description
Please see the attached queue configuration(jira-queues.yaml).
I will create 100 pods in Tier0, 100 pods in Tier1, 100 pods in Tier2 and 100 pods in Tier3. Each Pod will require 1 VCore. Initially, there will be 0 suitable nodes to run the Pods and all will be Pending. Karpenter will soon provision Nodes and Yunikorn will react by binding the Pods.
Given this code, I would expect Yunikorn to distribute the allocations such that each of the Tier’ed queues reaches its Guarantees. Instead, I observed a roughly even distribution of allocation across all of the queues.
Tier0 fails to meet its Gaurantees while Tier3, for instance, dramatically overshoots them.
> kubectl get pods -n finance | grep tier-0 | grep Pending | wc -l 86 > kubectl get pods -n finance | grep tier-1 | grep Pending | wc -l 83 > kubectl get pods -n finance | grep tier-2 | grep Pending | wc -l 78 > kubectl get pods -n finance | grep tier-3 | grep Pending | wc -l 77
Please see attached screen shots for queue usage.
Note, this situation can also be reproduced without the use of Karpenter by simply setting Yunikorn's `service.schedulingInterval` to a high duration, say 1m. Doing so will force Yunikorn to react to 400 Pods across 4 queues at roughly the same time forcing prioritization of queue allocations.
Test code to generate Pods:
from kubernetes import client, config config.load_kube_config() v1 = client.CoreV1Api() def create_pod_manifest(tier, exec,): pod_manifest = { 'apiVersion': 'v1', 'kind': 'Pod', 'metadata': { 'name': f"rolling-test-tier-{tier}-exec-{exec}", 'namespace': 'finance', 'labels': { 'applicationId': f"MyOwnApplicationId-tier-{tier}", 'queue': f"root.tiers.{tier}" }, "yunikorn.apache.org/user.info": '{"user":"system:serviceaccount:finance:spark","groups":["system:serviceaccounts","system:serviceaccounts:finance","system:authenticated"]}' }, 'spec': { "affinity": { "nodeAffinity" : { "requiredDuringSchedulingIgnoredDuringExecution" : { "nodeSelectorTerms" : [ { "matchExpressions" : [ { "key" : "di.rbx.com/dedicated", "operator" : "In", "values" : ["spark"] } ] } ] } }, }, "tolerations" : [ { "effect" : "NoSchedule", "key": "dedicated", "operator" : "Equal", "value" : "spark" }, ], "schedulerName": "yunikorn", 'restartPolicy': 'Always', 'containers': [{ "name": "ubuntu", 'image': 'ubuntu', "command": ["sleep", "604800"], "imagePullPolicy": "IfNotPresent", "resources" : { "limits" : { 'cpu' : "1" }, "requests" : { 'cpu' : "1" } } }] } } return pod_manifest for i in range(0,4): tier = str(i) for j in range(0,100): exec = str(j) pod_manifest = create_pod_manifest(tier, exec) print(pod_manifest) api_response = v1.create_namespaced_pod(body=pod_manifest, namespace="finance") print(f"creating tier( {tier} ) exec( {exec} )")
Attachments
Attachments
Issue Links
- is a parent of
-
YUNIKORN-2840 sortQueues: fair max performance and correctness change
- Open
- relates to
-
YUNIKORN-2789 Queue internalGetMax should use permissive calculator
- Resolved
- links to