Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
1.5.0
Description
Adding a new node after Yunikorn state initialization can result in a deadlock.
The problem is that Context.addNode() holds a lock while we're waiting for the NodeAccepted event:
dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, func(event interface{}) { nodeEvent, ok := event.(CachedSchedulerNodeEvent) if !ok { return } [...] removed for clarity wg.Done() }) defer dispatcher.UnregisterEventHandler(handlerID, dispatcher.EventTypeNode) if err := ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode(&si.NodeRequest{ Nodes: nodesToRegister, RmID: schedulerconf.GetSchedulerConf().ClusterID, }); err != nil { log.Log(log.ShimContext).Error("Failed to register nodes", zap.Error(err)) return nil, err } // wait for all responses to accumulate wg.Wait() <--- shim gets stuck here
If tasks are being processed, then the dispatcher will try to retrieve the evend handler, which is returned from Context:
go func() { for { select { case event := <-getDispatcher().eventChan: switch v := event.(type) { case events.TaskEvent: getEventHandler(EventTypeTask)(v) <--- eventually calls Context.getTask() case events.ApplicationEvent: getEventHandler(EventTypeApp)(v) case events.SchedulerNodeEvent: getEventHandler(EventTypeNode)(v)
Since addNode() is holding a write lock, the event processing loop gets stuck, so registerNodes() will never progress.
Attachments
Attachments
Issue Links
- causes
-
YUNIKORN-2910 Data corruption due to insufficient shim context locking
- Resolved
-
YUNIKORN-2668 Temporarily disable TestUpdateAllocation_NewTask_AssumePodFails
- Resolved
- is related to
-
YUNIKORN-2630 Release context lock in shim when processing config in the core
- Resolved
- relates to
-
YUNIKORN-2521 Scheduler deadlock
- Resolved
- links to