[YUNIKORN-1996] Change a log about queue update failure due to max capacity reached from Warn to Debug - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Invalid
Affects Version/s: None
Fix Version/s: None
Component/s: core - scheduler
Labels:
- pull-request-available

Description

We are seeing similar issue as in ~~YUNIKORN-1985~~:

Tons of logs (62k of them in 3 seconds for the same request) because the max capacity of a queue has reached,

       log.Log(log.SchedApplication).Warn("queue update failed unexpectedly",
         zap.Error(err))

 func (sa *Application) tryNode(node *Node, ask *AllocationAsk) *Allocation {
...
// everything OK really allocate
alloc := NewAllocation(common.GetNewUUID(), node.NodeID, ask)
if node.AddAllocation(alloc) {
   if err := sa.queue.IncAllocatedResource(alloc.GetAllocatedResource(), false); err != nil {
      log.Log(log.SchedApplication).Warn("queue update failed unexpectedly",
         zap.Error(err))
      // revert the node update
      node.RemoveAllocation(alloc.GetUUID())
      return nil
   }

I strongly suspect it’s simply because Yunikorn is trying a lot of nodes again and again, without being aware that the queue capacity exceeded, thus doing unnecessary work (because each try at that time is going to fail due to max capacity reached)

This certainly would impact Yunikorn’s performance.

I guess we need to introduce a categories of exceptions (MaxQueueCapReached, RequiredNodeUnavailable etc) that require delay before retry, and let the upper stack to catch the exception, put the allocation into a queue or something similar, and wait for certain period of time before retrying.

But as a first step, we can just change the log to Debug level. Since the UI provide a way to check how much resource a given queue is used, and whether it's at its max capacity reached, we don't lose too much diagnosis capability after changing the log to Debug.

Attachments

Issue Links

links to

GitHub Pull Request #661

Activity

People

Assignee:: Yongjun Zhang

Reporter:: Yongjun Zhang

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 20/Sep/23 18:55

Updated:: 25/Sep/23 21:37

Resolved:: 25/Sep/23 21:36