[YUNIKORN-677] Potential resource leak when complete and allocate pod happens simultaneously - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.11
Component/s: None
Labels:
- pull-request-available

Description

Let's say we have an app that has 1 pod needs for scheduling. The shim submits an app to the core, and start the schedule the pod. In the shim side, this is a task in the Scheduling state. Then we have a race if the following things happen simultaneously:

User deletes the pod, this triggers a CompleteTask event in the shim side, and the shim will send a ReleaseAllocationAskRequest to the core.
Before handling the ReleaseAllocationAskRequest from the shim, the core made an allocation for the given pod and send an Allocation to the shim

then the core generates an allocation on a node, core receives the release request and deletes the pending ask; the shim side receives the new allocation, but since the pod has already been deleted so the shim ignores this allocation. In this case, the allocation will be left-over causing the resource leak.

Attachments

Issue Links

blocks

YUNIKORN-686 [Regression] Placeholders for completed applications are recreated during recovery

Closed

breaks

YUNIKORN-741 Regression: occupied resources miscalculated sometimes for yunikorn pods

Closed

links to

GitHub Pull Request #265

Activity

People

Assignee:: Weiwei Yang

Reporter:: Weiwei Yang

Votes:: 1 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 20/May/21 03:51

Updated:: 21/Jan/22 21:48

Resolved:: 21/May/21 06:41