[YUNIKORN-2521] Scheduler deadlock - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.5.0
Fix Version/s: 1.5.1, 1.6.0
Component/s: None
Labels:
None
Environment:
Yunikorn: 1.5
AWS EKS: v1.28.6-eks-508b6b3

Target Version:

1.5.1, 1.6.0

Description

Discussion on Yunikorn slack: https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1711048995187179

Occasionally, Yunikorn will deadlock and prevent any new pods from starting. All pods stay in Pending. There are no error logs inside of the Yunikorn scheduler indicating any issue.

Additionally, the pods all have the correct annotations / labels from the admission service, so they are at least getting put into k8s correctly.

The issue was seen intermittently on Yunikorn version 1.5 in EKS, using version `v1.28.6-eks-508b6b3`.

At least for me, we run about 25-50 nodes and 200-400 pods. Pods and nodes are added and removed pretty frequently as we do ML workloads.

Attached is the goroutine dump. We were not able to get a statedump as the endpoint kept timing out.

You can fix it by restarting the Yunikorn scheduler pod. Sometimes you also have to delete any "Pending" pods that got stuck while the scheduler was deadlocked as well, for them to get picked up by the new scheduler pod.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

0001-YUNIKORN-2539-core.patch
05/Apr/24 16:11
43 kB
Craig Condit
0002-YUNIKORN-2539-k8shim.patch
05/Apr/24 16:11
31 kB
Craig Condit
4_4_goroutine-1.txt
04/Apr/24 20:57
85 kB
Noah Yoshida
4_4_goroutine-2.txt
04/Apr/24 20:57
85 kB
Noah Yoshida
4_4_goroutine-3.txt
04/Apr/24 20:57
86 kB
Noah Yoshida
4_4_goroutine-4.txt
04/Apr/24 20:57
86 kB
Noah Yoshida
4_4_goroutine-5-state-dump.txt
04/Apr/24 20:57
88 kB
Noah Yoshida
4_4_profile001.png
04/Apr/24 20:57
407 kB
Noah Yoshida
4_4_profile002.png
04/Apr/24 20:57
508 kB
Noah Yoshida
4_4_profile003.png
04/Apr/24 20:57
449 kB
Noah Yoshida
4_4_scheduler-logs.txt
04/Apr/24 20:57
180 kB
Noah Yoshida
deadlock_2024-04-18.log
18/Apr/24 15:40
11 kB
Xi Chen
goroutine-4-3.out
03/Apr/24 21:38
1.16 MB
Shravan Achar
goroutine-4-3-1.out
03/Apr/24 21:38
8 kB
Shravan Achar
goroutine-4-3-2.out
03/Apr/24 21:38
8 kB
Shravan Achar
goroutine-4-3-3.out
03/Apr/24 21:38
1.39 MB
Shravan Achar
goroutine-4-5.out
05/Apr/24 20:44
991 kB
Shravan Achar
goroutine-dump.txt
28/Mar/24 19:36
32 kB
Noah Yoshida
goroutine-while-blocking.out
03/Apr/24 21:48
1.53 MB
Shravan Achar
goroutine-while-blocking-2.out
03/Apr/24 21:48
8 kB
Shravan Achar
logs-potential-deadlock.txt
05/Apr/24 17:57
14 kB
Shravan Achar
logs-potential-deadlock-2.txt
05/Apr/24 17:57
14 kB
Shravan Achar
logs-splunk.txt
05/Apr/24 22:35
47.77 MB
Shravan Achar
logs-splunk-ordered.txt
05/Apr/24 22:41
15.92 MB
Shravan Achar
profile001-4-5.gif
05/Apr/24 20:53
192 kB
Shravan Achar
profile012.gif
03/Apr/24 21:38
184 kB
Shravan Achar
profile013.gif
03/Apr/24 21:38
184 kB
Shravan Achar
running-logs.txt
05/Apr/24 20:43
48.88 MB
Shravan Achar
running-logs-2.txt
05/Apr/24 21:29
6.25 MB
Shravan Achar

Issue Links

is related to

YUNIKORN-2629 Adding a node can result in a deadlock

Resolved

relates to

YUNIKORN-2539 Add optional deadlock detection

Resolved

YUNIKORN-2544 [UMBRELLA] Fix Yunikorn potential locking issues

Resolved

Activity

People

Assignee:: Peter Bacsko

Reporter:: Noah Yoshida

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 28/Mar/24 19:38

Updated:: 23/May/24 03:54

Resolved:: 23/Apr/24 19:42