Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
None
Description
On a node restart the pods assigned and running on a node are not checked against the quota of the queue(s) they run in. This has multiple reasons. Pods on a node that are scheduled by YuniKorn and already running must not be rejected. Rejecting pods could cause lots of side effects.
The combination of a node restart and the reconfiguring a GPU driver could however cause a secondary issue. The node on restart might not expose the GPU resource yet. Pods that ran before the restart can be using the GPU resource. After those pods are added, ignoring quotas, the root queue will show a usage for a resource that has not been registered yet.
This fact prevents all scheduling from progressing. Even for pods not requesting the GPU resource. Each scheduling action will check the root queue quota and fail. This prevents the GPU driver pods to be placed and the GPU to be registered by the node.
Attachments
Issue Links
- relates to
-
YUNIKORN-2794 Resource: Change SubOnlyExisting() to same signature as AddOnlyExisting()
- Resolved
- links to