[YUNIKORN-1988] Preemption happens when a queue lower than its guaranteed capacity - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: In Progress
Priority: Critical
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: core - scheduler
Labels:
- pull-request-available

Description

Background:
We set tier based priorityClass and using YuniKorn 1.3 with Admission controller in production (our prod cluster has hundreds of EKS nodes).
Many production tier2 jobs got preempted unexpectedly. From application log, we saw driver pods all got shutdown around same time.

Most failed jobs were from the same queue, we set 300G as guaranteed memory for queue that got preempted, all driver pods required 24G memory.
Right now we disabled preemption feature in production to mitigate the issue.

Investigation:

Reproduced the issue on dev env, preemption can happen when a queue is lower than its guaranteed capacity.

Confirmed yunikorn k8shim log: our driver pods got set as originator.

I am investigating how to fix the issue.

Attachments

Issue Links

is related to

YUNIKORN-1990 Intra-queue preemption happens

Closed

links to

GitHub Pull Request #660

Activity

People

Assignee:: Rainie Li

Reporter:: Rainie Li

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 14/Sep/23 21:32

Updated:: 01/Nov/23 23:04