[YUNIKORN-2180] Clean up scheduler state initialization - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.5.0
Component/s: None
Labels:
- pull-request-available

Target Version:

1.5.0

Description

Scheduler state initialization (otherwise known as recovery) is currently fragile and somewhat unpredictable since multiple asynchronous processes coordinate to perform the various init tasks.

Startup initialization should be simplified to the following steps:

Read all priority classes from the informer and register them with the scheduler cache
Read all nodes from the informer and register them (in a drained state) with the scheduler core
Read all pods from the informer and register applications and allocations as necessary, associating existing allocations with nodes from step #2
Enable the nodes which were originally registered in step #2
Register and start Kubernetes event handlers
Re-read priority classes from the informer and remove any that have gone away since step #1, ensuring we don't miss priority class deletions during init
Re-read nodes from the informer and remove any that have gone away since step #2, ensuring we don't miss node deletions during init
Re-read pods from the informer and remove any that have gone away since step #3, ensuring we don't miss pod deletions during init

Additionally, this process should be handled entirely by the scheduler context to avoid mulitple competing concerns.

Attachments

Issue Links

fixes

YUNIKORN-1169 Fix ApplicationMetadata restoration during recovery

Closed

YUNIKORN-1615 Node occupied resource is negative

Closed

YUNIKORN-936 app and node recovery event ordering

Closed

YUNIKORN-2189 Change the log level of "adding node to cache" from WARN to INFO

Closed

links to

GitHub Pull Request #734

Activity

People

Assignee:: Craig Condit

Reporter:: Craig Condit

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 22/Nov/23 18:18

Updated:: 20/Mar/24 14:32

Resolved:: 04/Jan/24 18:12