Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
None
Description
In e2e test for "user_group_limit", we updated Yunikorn ConfigMap before submit sleep pods. https://github.com/apache/yunikorn-k8shim/blob/master/test/e2e/user_group_limit/user_group_limit_test.go#L67-L70
However, the AdmissionControllerConf in Admission controller doesn't immediately reflect the changes. The time gap leads following error in CI flow: https://github.com/apache/yunikorn-k8shim/actions/runs/6282144385/job/17062987755?pr=677#step:5:2346
We need to find a way to ensure AdmissionControllerConf has been updated before e2e test submits a new pods.
This issue can be reproduced if we introduce 1 second delay when the admission controller updates the AdmissionControllerConf. Here is the example code.
func (h *configMapUpdateHandler) OnUpdate(_, newObj interface{}) { cm := utils.Convert2ConfigMap(newObj) // sleep 1 to delay AdmissionControllerConf update time.Sleep(1 * time.Second) if idx, ok := h.configMapIndex(cm); ok { h.conf.configUpdated(idx, cm) } }
https://github.com/apache/yunikorn-k8shim/blob/master/pkg/admission/conf/am_conf.go#L237-L238
Below are the before/after AM logs when we added 1 sec delay:
Success without delay 1 sec:
- (AM configMapUpdateHandler) 05:36:39.184Z. AdmissionControllerConf trying to upgrade config
- (AM configMapUpdateHandler) 05:36:39.185Z . AdmissionControllerConf config upgraded
- (AM Webhook) 05:36:39.218Z. AM receive AdmissionReview request
- (AM Webhook) 05:36:39.221Z. AM check Pods with new config, E2E Test Passed
Failed with delay 1 sec:
- (AM configMapUpdateHandler) 08:19:31.025Z. AdmissionControllerConf trying to upgrade config
- (AM Webhook) 08:19:31.067Z. AM receive AdmissionReview request
- (AM Webhook) 08:19:31.069Z AM check Pods with stale config, E2E Test failed
- (AM configMapUpdateHandler) 08:19:32.026Z. AdmissionControllerConf config upgraded
In my kind cluster(v1.24.15), there is only 0.036 sec gap (05:36:39.185Z~ 05:36:39.221Z). It's possible that the admission controller in CI flow runs those steps in different order.
The possible Solution:
1. In the e2e test, sleep for 1 seconds between updating the configmap and submitting new sleep pods. (It's quick fix, I assumed the time gap is always less than 1 sec )
2. In the e2e test, check current AdmissionControllerConf's value before submit a new pod. (How do client dump current AdmissionControllerConf? Need to seek for more advice.)
Since this issue only impact e2e test for now, we can go with solution #1 as a quick fix. But it'll be better if we allow client to check whether AdmissionControllerConf is up-to-date.
Please kindly let me know if I have any misunderstandings.
Attachments
Issue Links
- links to