[YUNIKORN-2790] GPU node restart could leave root queue always out of quota - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.6.0
Component/s: core - scheduler
Labels:
- pull-request-available
- release-notes

Target Version:

1.6.0

Description

On a node restart the pods assigned and running on a node are not checked against the quota of the queue(s) they run in. This has multiple reasons. Pods on a node that are scheduled by YuniKorn and already running must not be rejected. Rejecting pods could cause lots of side effects.

The combination of a node restart and the reconfiguring a GPU driver could however cause a secondary issue. The node on restart might not expose the GPU resource yet. Pods that ran before the restart can be using the GPU resource. After those pods are added, ignoring quotas, the root queue will show a usage for a resource that has not been registered yet.

This fact prevents all scheduling from progressing. Even for pods not requesting the GPU resource. Each scheduling action will check the root queue quota and fail. This prevents the GPU driver pods to be placed and the GPU to be registered by the node.

Attachments

Issue Links

relates to

YUNIKORN-2794 Resource: Change SubOnlyExisting() to same signature as AddOnlyExisting()

Resolved

links to

GitHub Pull Request #933

Activity

People

Assignee:: Wilfred Spiegelenburg

Reporter:: Wilfred Spiegelenburg

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 07/Aug/24 06:01

Updated:: 08/Aug/24 14:08

Resolved:: 08/Aug/24 14:08