Uploaded image for project: 'Apache YuniKorn'
  1. Apache YuniKorn
  2. YUNIKORN-2520

PVC errors in AssumePod() are not handled properly

    XMLWordPrintableJSON

Details

    Description

      When there is an error caused by a volume operation in Context.AssumePod(), the allocation on core side will not be removed.

      Although we check the result from UpdateAllocation, the error handling is just logging:

                      if err := callback.UpdateAllocation(response); err != nil {
      			rmp.handleUpdateResponseError(rmID, err)
      		}
      ...
      
      func (rmp *RMProxy) handleUpdateResponseError(rmID string, err error) {
          log.Log(log.RMProxy).Error("failed to handle response",
             zap.String("rmID", rmID),
             zap.Error(err))
      }

      I suggest moving volume-related code to Task.postTaskAllocated(). In this case, the task will transition to "Failed" state and we'll have allocationID available, so we can release both the ask and the allocation:

      func (task *Task) releaseAllocation() {
      	        ...
      		var releaseRequest *si.AllocationRequest
      		s := TaskStates()
      		switch task.GetTaskState() {
      		case s.New, s.Pending, s.Scheduling, s.Rejected:
      			releaseRequest = common.CreateReleaseAskRequestForTask(
      				task.applicationID, task.taskID, task.application.partition)  <-- release ask + allocation if possible
      		default:
      			if task.allocationID == "" {
      				... log error ...
      				return
      			}
      			releaseRequest = common.CreateReleaseAllocationRequestForTask(
      				task.applicationID, task.taskID, task.allocationID, task.application.partition, task.terminationType)
      		}
      ...

       

      Attachments

        Issue Links

          Activity

            People

              pbacsko Peter Bacsko
              pbacsko Peter Bacsko
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: