Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.15.3, 1.16.1, 1.18.0, 1.17.1
Description
This build ran into a timeout. Based on the stacktraces reported, it was either caused by SnapshotMigrationTestBase.restoreAndExecute:
"main" #1 prio=5 os_prio=0 tid=0x00007f23d800b800 nid=0x60cdd waiting on condition [0x00007f23e1c0d000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.flink.test.checkpointing.utils.SnapshotMigrationTestBase.restoreAndExecute(SnapshotMigrationTestBase.java:382) at org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSnapshot(TypeSerializerSnapshotMigrationITCase.java:172) at sun.reflect.GeneratedMethodAccessor47.invoke(Unknown Source) [...]
or PartiallyFinishedSourcesITCase.test:
2023-02-20T07:13:05.6084711Z "main" #1 prio=5 os_prio=0 tid=0x00007fd35c00b800 nid=0x8c8a waiting on condition [0x00007fd363d0f000] 2023-02-20T07:13:05.6085149Z java.lang.Thread.State: TIMED_WAITING (sleeping) 2023-02-20T07:13:05.6085487Z at java.lang.Thread.sleep(Native Method) 2023-02-20T07:13:05.6085925Z at org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:145) 2023-02-20T07:13:05.6086512Z at org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:138) 2023-02-20T07:13:05.6087103Z at org.apache.flink.runtime.testutils.CommonTestUtils.waitForSubtasksToFinish(CommonTestUtils.java:291) 2023-02-20T07:13:05.6087730Z at org.apache.flink.runtime.operators.lifecycle.TestJobExecutor.waitForSubtasksToFinish(TestJobExecutor.java:226) 2023-02-20T07:13:05.6088410Z at org.apache.flink.runtime.operators.lifecycle.PartiallyFinishedSourcesITCase.test(PartiallyFinishedSourcesITCase.java:138) 2023-02-20T07:13:05.6088957Z at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [...]
Still, it sounds odd: Based on a code analysis it's quite unlikely that those two caused the issue. The former one has a 5 min timeout (see related code in SnapshotMigrationTestBase:382). For the other one, we found it being not responsible in the past when some other concurrent test caused the issue (see FLINK-30261).
An investigation on where we lose the time for the timeout revealed that AdaptiveSchedulerITCase took 2980s to finish (see build logs).
2023-02-20T03:43:55.4546050Z Feb 20 03:43:55 [ERROR] Picked up JAVA_TOOL_OPTIONS: -XX:+HeapDumpOnOutOfMemoryError 2023-02-20T03:43:58.0448506Z Feb 20 03:43:58 [INFO] Running org.apache.flink.test.scheduling.AdaptiveSchedulerITCase 2023-02-20T04:33:38.6824634Z Feb 20 04:33:38 [INFO] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2,980.445 s - in org.apache.flink.test.scheduling.AdaptiveSchedulerITCase
Attachments
Issue Links
- links to