Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
When there is an error caused by a volume operation in Context.AssumePod(), the allocation on core side will not be removed.
Although we check the result from UpdateAllocation, the error handling is just logging:
if err := callback.UpdateAllocation(response); err != nil { rmp.handleUpdateResponseError(rmID, err) } ... func (rmp *RMProxy) handleUpdateResponseError(rmID string, err error) { log.Log(log.RMProxy).Error("failed to handle response", zap.String("rmID", rmID), zap.Error(err)) }
I suggest moving volume-related code to Task.postTaskAllocated(). In this case, the task will transition to "Failed" state and we'll have allocationID available, so we can release both the ask and the allocation:
func (task *Task) releaseAllocation() { ... var releaseRequest *si.AllocationRequest s := TaskStates() switch task.GetTaskState() { case s.New, s.Pending, s.Scheduling, s.Rejected: releaseRequest = common.CreateReleaseAskRequestForTask( task.applicationID, task.taskID, task.application.partition) <-- release ask + allocation if possible default: if task.allocationID == "" { ... log error ... return } releaseRequest = common.CreateReleaseAllocationRequestForTask( task.applicationID, task.taskID, task.allocationID, task.application.partition, task.terminationType) } ...
Attachments
Issue Links
- links to