Details
-
Sub-task
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
Description
We got below E2E test fails In gang_scheduling e2e test “Verify_Hard_GS_Failed_State”.
- https://github.com/apache/yunikorn-k8shim/actions/runs/7356744028/job/20027836104#step:6:972 (PR of
YUNIKORN-2292) - https://github.com/apache/yunikorn-k8shim/actions/runs/7308989229/job/19960722817?pr=753#step:6:971 (PR of
YUNIKORN-2247)
The e2e test waits until application status turn into ‘Failing’. (gang_scheduling_test.go#L288) However, the application won't stay in "Failing" too long. Below are my local test results.
- 0.565 seconds
- 0.519 seconds
- 0.634 seconds
- 0.604 seconds
- 0.573 seconds
- 0.586 seconds
- 0.587 seconds
- 0.640 seconds
- 0.779 seconds
- 0.584 seconds
(PS: Compare the time between 2 failApplication events, "Accept->Failing", "Failing -> Failed")
The polling frequency of checkAppStatus() is 300ms, so this issue still can't be reproduced in my local environment. However, we still have no guarantee that the application will stay in 'Failing' longer than 300 ms.
(The dumped scheduler log of the e2e test is missing due to the issue mentioned in YUNIKORN-2293. The e2e test didn't call tests.LogYunikornContainer() in AfterEach. After YUNIKORN-2293 fixed, we will be able to check the failed log in Github action.)
Attachments
Issue Links
- Blocked
-
YUNIKORN-2291 Flaky e2e test
- Closed
- links to