[YUNIKORN-584] App recovery is skipped when applicationID is not set in pods' label - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.10
Component/s: shim - kubernetes
Labels:
- pull-request-available

Description

There are cases when YK may think that the cluster doesn't have enough resources even though that's not actually the case. This has happened twice to me after running YK in a cluster for a few days and then one day, the nodes endpoint shows that the cluster only has one node (i.e. the node that YK itself is running on), despite that the K8s cluster has 10 nodes in total. And if I try to schedule a workload that requires more resources than available on that node, YK will make pods pending with an event like below:

Normal PodUnschedulable 41s yunikorn Task <namespace>/<pod> is pending for the requested resources become available

because it's not aware that other nodes in the cluster has available resources.

All of this can be fixed by just restarting YK (scaling down the replica to 0 and then back up to 1). So it seems that an issue with cache is causing the issue, although it's not yet clear to me the exact conditions that triggered this bug.

My environment is on AWS EKS with K8s 1.17, if that matters.

Attachments

Issue Links

is related to

YUNIKORN-593 Optimize the UT for ListApplications

Closed

links to

GitHub Pull Request #246

GitHub Pull Request #247

Activity

People

Assignee:: Weiwei Yang

Reporter:: Chaoran Yu

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 18/Mar/21 12:33

Updated:: 21/Jan/22 21:48

Resolved:: 22/Mar/21 16:10