Uploaded image for project: 'Apache YuniKorn'
  1. Apache YuniKorn
  2. YUNIKORN-1998

Stale AdmissionControllerConf was used in e2e test

    XMLWordPrintableJSON

Details

    Description

      In e2e test for "user_group_limit", we updated Yunikorn ConfigMap before submit sleep pods.  https://github.com/apache/yunikorn-k8shim/blob/master/test/e2e/user_group_limit/user_group_limit_test.go#L67-L70

      However, the AdmissionControllerConf in Admission controller doesn't immediately reflect the changes. The time gap leads following error in CI flow:  https://github.com/apache/yunikorn-k8shim/actions/runs/6282144385/job/17062987755?pr=677#step:5:2346

      We need to find a way to ensure AdmissionControllerConf has been updated before e2e test submits a new pods.

      This issue can be reproduced if we introduce 1 second delay when the admission controller updates the AdmissionControllerConf. Here is the example code.

      func (h *configMapUpdateHandler) OnUpdate(_, newObj interface{}) {
          cm := utils.Convert2ConfigMap(newObj)
          
          // sleep 1 to delay AdmissionControllerConf update
          time.Sleep(1 * time.Second)
          
          if idx, ok := h.configMapIndex(cm); ok {
              h.conf.configUpdated(idx, cm)
          }
      } 

      https://github.com/apache/yunikorn-k8shim/blob/master/pkg/admission/conf/am_conf.go#L237-L238

      Below are the before/after AM logs when we added 1 sec delay:

      Success without delay 1 sec:

      • (AM configMapUpdateHandler)             05:36:39.184Z.  AdmissionControllerConf trying to upgrade config
      • (AM configMapUpdateHandler)             05:36:39.185Z . AdmissionControllerConf config upgraded
      • (AM Webhook)                                       05:36:39.218Z.  AM receive AdmissionReview request
      • (AM Webhook)                                       05:36:39.221Z.  AM check Pods with new config,  E2E Test Passed 

      Failed with delay 1 sec:

      • (AM configMapUpdateHandler)            08:19:31.025Z.   AdmissionControllerConf trying to upgrade config
      • (AM Webhook)                                      08:19:31.067Z.   AM receive AdmissionReview request
      • (AM Webhook)                                      08:19:31.069Z    AM check Pods with stale config, E2E Test failed 
      • (AM configMapUpdateHandler)            08:19:32.026Z.  AdmissionControllerConf config upgraded

      In my kind cluster(v1.24.15), there is only 0.036 sec gap (05:36:39.185Z~ 05:36:39.221Z).  It's possible that the admission controller in CI flow runs those steps in different order.

      The possible Solution:
      1. In the e2e test, sleep for 1 seconds between updating the configmap and submitting new sleep pods. (It's quick fix, I assumed the time gap is always less than 1 sec ) 
      2. In the e2e test, check current AdmissionControllerConf's value before submit a new pod. (How do client dump current AdmissionControllerConf? Need to seek for more advice.)

      Since this issue only impact e2e test for now, we can go with solution #1 as a quick fix.  But it'll be better if we allow client to check whether AdmissionControllerConf is up-to-date.

      Please kindly let me know if I have any misunderstandings. 

      Attachments

        Issue Links

          Activity

            People

              Yu-Lin Chen Yu-Lin Chen
              Yu-Lin Chen Yu-Lin Chen
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: