Attach filesAttach ScreenshotVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.5.0
    • None

    Description

      Scheduler state initialization (otherwise known as recovery) is currently fragile and somewhat unpredictable since multiple asynchronous processes coordinate to perform the various init tasks.

      Startup initialization should be simplified to the following steps:

      1. Read all priority classes from the informer and register them with the scheduler cache
      2. Read all nodes from the informer and register them (in a drained state) with the scheduler core
      3. Read all pods from the informer and register applications and allocations as necessary, associating existing allocations with nodes from step #2
      4. Enable the nodes which were originally registered in step #2
      5. Register and start Kubernetes event handlers
      6. Re-read priority classes from the informer and remove any that have gone away since step #1, ensuring we don't miss priority class deletions during init
      7. Re-read nodes from the informer and remove any that have gone away since step #2, ensuring we don't miss node deletions during init
      8. Re-read pods from the informer and remove any that have gone away since step #3, ensuring we don't miss pod deletions during init

      Additionally, this process should be handled entirely by the scheduler context to avoid mulitple competing concerns.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            ccondit Craig Condit
            ccondit Craig Condit
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment