[YUNIKORN-2171] race between node removal and scheduling cycle - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.0.0, 1.1.0, 1.2.0, 1.3.0, 1.4.0
Fix Version/s: 1.5.0
Component/s: core - scheduler
Labels:
- pull-request-available

Target Version:

1.5.0

Description

When a node gets removed the partition resources and thus the root max resources are decreased. The node removal locks the partition, removes the node and releases the partition lock before proceeding. Cleanup of the allocations happens after that. This means that for a short period of time the root queue max resources are already decreased while the usage is not.

The scheduling cycle could be running during the node removal. The queue headroom calculation uses the queue max resources and usage to calculate the difference. The whole hierarchy is traversed for this.

If the headroom is limited by the root queue then we could have a race between the removal of the node allocations and scheduling:

scheduling starts and queue headroom is calculated
node is removed, queue max is lowered
scheduling finds new allocation
new allocation gets added to the queue updating usage
root queue is over its max already or would go over max: scheduling fails
node allocation removal proceeds and corrects the queue usage

Attachments

Issue Links

is caused by

YUNIKORN-551 node removal races for lock during scheduling

Closed

links to

GitHub Pull Request #724

Activity

People

Assignee:: Wilfred Spiegelenburg

Reporter:: Wilfred Spiegelenburg

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 22/Nov/23 02:31

Updated:: 20/Mar/24 14:32

Resolved:: 28/Nov/23 01:25