Uploaded image for project: 'Apache YuniKorn'
  1. Apache YuniKorn
  2. YUNIKORN-2295 [Umbrella] Stabilize E2E tests
  3. YUNIKORN-2294

Flaky E2E Test: "Verify_Hard_GS_Failed_State" polling short-lived "Failing" application status

    XMLWordPrintableJSON

Details

    Description

      We got below E2E test fails In gang_scheduling e2e test “Verify_Hard_GS_Failed_State”.

      1. https://github.com/apache/yunikorn-k8shim/actions/runs/7356744028/job/20027836104#step:6:972 (PR of YUNIKORN-2292)
      2. https://github.com/apache/yunikorn-k8shim/actions/runs/7308989229/job/19960722817?pr=753#step:6:971 (PR of YUNIKORN-2247)

      The e2e test waits until application status turn into ‘Failing’. (gang_scheduling_test.go#L288) However, the application won't stay in "Failing" too long.  Below are my local test results.

      1. 0.565 seconds
      2. 0.519 seconds
      3. 0.634 seconds
      4. 0.604 seconds
      5. 0.573 seconds
      6. 0.586 seconds
      7. 0.587 seconds
      8. 0.640 seconds
      9. 0.779 seconds
      10. 0.584 seconds

      (PS: Compare the time between 2 failApplication events, "Accept->Failing", "Failing -> Failed")

      The polling frequency of checkAppStatus() is 300ms, so this issue still can't be reproduced in my local environment. However, we still have no guarantee that the application will stay in 'Failing' longer than 300 ms.

      (The dumped scheduler log of the e2e test is missing due to the issue mentioned in YUNIKORN-2293. The e2e test didn't call tests.LogYunikornContainer() in AfterEach. After YUNIKORN-2293 fixed, we will be able to check the failed log in Github action.)

      Attachments

        Issue Links

          Activity

            People

              Yu-Lin Chen Yu-Lin Chen
              Yu-Lin Chen Yu-Lin Chen
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: