Uploaded image for project: 'Apache YuniKorn'
  1. Apache YuniKorn
  2. YUNIKORN-2629

Adding a node can result in a deadlock

    XMLWordPrintableJSON

Details

    Description

      Adding a new node after Yunikorn state initialization can result in a deadlock.

      The problem is that Context.addNode() holds a lock while we're waiting for the NodeAccepted event:

             dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, func(event interface{}) {
      		nodeEvent, ok := event.(CachedSchedulerNodeEvent)
      		if !ok {
      			return
      		}
      	        [...] removed for clarity
      		wg.Done()
      	})
      	defer dispatcher.UnregisterEventHandler(handlerID, dispatcher.EventTypeNode)
      	if err := ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode(&si.NodeRequest{
      		Nodes: nodesToRegister,
      		RmID:  schedulerconf.GetSchedulerConf().ClusterID,
      	}); err != nil {
      		log.Log(log.ShimContext).Error("Failed to register nodes", zap.Error(err))
      		return nil, err
      	}
      
      	// wait for all responses to accumulate
      	wg.Wait()  <--- shim gets stuck here
       

      If tasks are being processed, then the dispatcher will try to retrieve the evend handler, which is returned from Context:

      go func() {
      		for {
      			select {
      			case event := <-getDispatcher().eventChan:
      				switch v := event.(type) {
      				case events.TaskEvent:
      					getEventHandler(EventTypeTask)(v)  <--- eventually calls Context.getTask()
      				case events.ApplicationEvent:
      					getEventHandler(EventTypeApp)(v)
      				case events.SchedulerNodeEvent:
      					getEventHandler(EventTypeNode)(v)  
      

      Since addNode() is holding a write lock, the event processing loop gets stuck, so registerNodes() will never progress.

      Attachments

        1. yunikorn_stuck_stack_20240708.txt
          86 kB
          Xi Chen
        2. yunikorn-scheduler-20240627.log
          9 kB
          Xi Chen
        3. updateNode_deadlock_trace.txt
          4 kB
          Wilfred Spiegelenburg

        Issue Links

          Activity

            People

              pbacsko Peter Bacsko
              pbacsko Peter Bacsko
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: