Details
-
Sub-task
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
1.19.0
-
None
Description
DRILL-7908 fixes distributed deadlocks in TestDrillbitResilience and add better timing for simulation the different Drill states. But sometimes several tests failed.
1. Sometimes tests indicate memory leak:
Error: Failures: 3419Error: org.apache.drill.exec.server.TestDrillbitResilience.cancelInMiddleOfFetchingResults 3420Error: Run 1: TestDrillbitResilience.cancelInMiddleOfFetchingResults:375 We are leaking 3000000 bytes ==> expected: <0> but was: <3000000>
But actually there is no memory leak. Looks like Drill just check actual memory to early, when dot all fragments are closed, so adding timeout before final countAllocatedMemory fixes the issue.
The other reason of test failures - the queries were not in expected state before cancelling (for instance in STARTING state instead of RUNNING), so adding timeout before starting cancellation thread allows to wait the proper drill query state, which is expected to be for Drill in test case before cancellation.
I don't have anymore test failures with NUM_RUNS = 1000 (@RepeatedTest) for the problematic test cases.
2. The other test case which failed is:
Error: Failures: 3540Error: TestDrillbitResilience.foreman_runTryEnd:289->testForeman:973->assertFailsWithException:960->assertFailsWithException:954 Query state should be FAILED (and not COMPLETED). ==> expected: <COMPLETED> but was: <FAILED>
It relates to DRILL-3167. The root cause here is the following: in some cases we are completing the query faster than run-try-end exception is injecetd and thrown in Foreman. The Completed state is acceptable for such cases
Attachments
Issue Links
- fixes
-
DRILL-3192 TestDrillbitResilience#cancelWhenQueryIdArrives hangs
- Resolved
-
DRILL-3052 canceling a fragment executor before it starts running will cause the Foreman to wait indefinitely for a terminal message from that fragment
- Resolved
-
DRILL-3167 When a query fails, Foreman should wait for all fragments to finish cleaning up before sending a FAILED state to the client
- Resolved
-
DRILL-3193 TestDrillbitResilience#interruptingWhileFragmentIsBlockedInAcquiringSendingTicket hangs and fails
- Resolved
-
DRILL-3194 TestDrillbitResilience#memoryLeaksWhenFailed hangs
- Resolved
-
DRILL-6228 Random failures of TestDrillbitResilience tests
- Resolved
-
DRILL-3967 Broken Test: TestDrillbitResilience.cancelAfterEverythingIsCompleted()
- Closed
- is part of
-
DRILL-7973 Fix GitHub CI intermittent failures
- Resolved
- is related to
-
DRILL-3163 Fix hang/ leak issue exposed by TestDrillbitResilience#foreman_runTryEnd
- Closed