Uploaded image for project: 'Apache YuniKorn'
  1. Apache YuniKorn
  2. YUNIKORN-2790

GPU node restart could leave root queue always out of quota

    XMLWordPrintableJSON

Details

    Description

      On a node restart the pods assigned and running on a node are not checked against the quota of the queue(s) they run in. This has multiple reasons. Pods on a node that are scheduled by YuniKorn and already running must not be rejected. Rejecting pods could cause lots of side effects.

      The combination of a node restart and the reconfiguring a GPU driver could however cause a secondary issue. The node on restart might not expose the GPU resource yet. Pods that ran before the restart can be using the GPU resource. After those pods are added, ignoring quotas, the root queue will show a usage for a resource that has not been registered yet.

      This fact prevents all scheduling from progressing. Even for pods not requesting the GPU resource. Each scheduling action will check the root queue quota and fail. This prevents the GPU driver pods to be placed and the GPU to be registered by the node.

      Attachments

        Issue Links

          Activity

            People

              wilfreds Wilfred Spiegelenburg
              wilfreds Wilfred Spiegelenburg
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: