Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
1.14.0
Description
- We experienced a failure of OperatorCoordinatorSchedulerTest in our VVP Fork of Flink. The finegrained_resource_management test run failed with an non-0 exit code:
Nov 01 17:19:12 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.22.2:test (default-test) on project flink-runtime: There are test failures. Nov 01 17:19:12 [ERROR] Nov 01 17:19:12 [ERROR] Please refer to /__w/1/s/flink-runtime/target/surefire-reports for the individual test results. Nov 01 17:19:12 [ERROR] Please refer to dump files (if any exist) [date].dump, [date]-jvmRun[N].dump and [date].dumpstream. Nov 01 17:19:12 [ERROR] ExecutionException The forked VM terminated without properly saying goodbye. VM crash or System.exit called? Nov 01 17:19:12 [ERROR] Command was /bin/sh -c cd /__w/1/s/flink-runtime && /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xms256m -Xmx2048m -Dmvn.forkNumber=2 -XX:+UseG1GC -jar /__w/1/s/flink-runtime/target/surefire/surefirebooter6007815607334336440.jar /__w/1/s/flink-runtime/target/surefire 2021-11-01T16-51-51_363-jvmRun2 surefire6448660128033443499tmp surefire_4131168043975619749001tmp Nov 01 17:19:12 [ERROR] Error occurred in starting fork, check output in log Nov 01 17:19:12 [ERROR] Process Exit Code: 239 Nov 01 17:19:12 [ERROR] Crashed tests: Nov 01 17:19:12 [ERROR] org.apache.flink.runtime.operators.coordination.OperatorCoordinatorSchedulerTest Nov 01 17:19:12 [ERROR] org.apache.maven.surefire.booter.SurefireBooterForkException: ExecutionException The forked VM terminated without properly saying goodbye. VM crash or System.exit called? Nov 01 17:19:12 [ERROR] Command was /bin/sh -c cd /__w/1/s/flink-runtime && /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xms256m -Xmx2048m -Dmvn.forkNumber=2 -XX:+UseG1GC -jar /__w/1/s/flink-runtime/target/surefire/surefirebooter6007815607334336440.jar /__w/1/s/flink-runtime/target/surefire 2021-11-01T16-51-51_363-jvmRun2 surefire6448660128033443499tmp surefire_4131168043975619749001tmp Nov 01 17:19:12 [ERROR] Error occurred in starting fork, check output in log Nov 01 17:19:12 [ERROR] Process Exit Code: 239 Nov 01 17:19:12 [ERROR] Crashed tests: Nov 01 17:19:12 [ERROR] org.apache.flink.runtime.operators.coordination.OperatorCoordinatorSchedulerTest Nov 01 17:19:12 [ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:510) Nov 01 17:19:12 [ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.runSuitesForkPerTestSet(ForkStarter.java:457)
It looks like the testSnapshotAsyncFailureFailsCheckpoint caused it even though finishing successfully due to a fatal error when shutting down the cluster:
17:07:27,264 [ Checkpoint Timer] ERROR org.apache.flink.util.FatalExitExceptionHandler [] - FATAL: Thread 'Checkpoint Timer' produced an uncaught exception. Stopping the process... java.util.concurrent.CompletionException: java.util.concurrent.CompletionException: java.lang.IllegalStateException: CheckpointsCleaner has already been closed at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$7(CheckpointCoordinator.java:626) ~[classes/:?] at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884) ~[?:1.8.0_292] at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866) ~[?:1.8.0_292] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) [?:1.8.0_292] at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575) [?:1.8.0_292] at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:814) [?:1.8.0_292] at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) [?:1.8.0_292] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_292] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_292] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_292] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_292] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_292] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_292] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292] Caused by: java.util.concurrent.CompletionException: java.lang.IllegalStateException: CheckpointsCleaner has already been closed at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273) ~[?:1.8.0_292] at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280) ~[?:1.8.0_292] at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:838) ~[?:1.8.0_292] at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) ~[?:1.8.0_292] ... 8 more Caused by: java.lang.IllegalStateException: CheckpointsCleaner has already been closed at org.apache.flink.util.Preconditions.checkState(Preconditions.java:193) ~[flink-core-1.14-stream-SNAPSHOT.jar:1.14-stream-SNAPSHOT] at org.apache.flink.runtime.checkpoint.CheckpointsCleaner.incrementNumberOfCheckpointsToClean(CheckpointsCleaner.java:105) ~[classes/:?] at org.apache.flink.runtime.checkpoint.CheckpointsCleaner.cleanup(CheckpointsCleaner.java:87) ~[classes/:?] at org.apache.flink.runtime.checkpoint.CheckpointsCleaner.cleanCheckpoint(CheckpointsCleaner.java:62) ~[classes/:?] at org.apache.flink.runtime.checkpoint.PendingCheckpoint.dispose(PendingCheckpoint.java:573) ~[classes/:?] at org.apache.flink.runtime.checkpoint.PendingCheckpoint.abort(PendingCheckpoint.java:551) ~[classes/:?] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1939) ~[classes/:?] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1926) ~[classes/:?] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:910) ~[classes/:?] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:875) ~[classes/:?] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$6(CheckpointCoordinator.java:614) ~[classes/:?] at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) ~[?:1.8.0_292] at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) ~[?:1.8.0_292] ... 8 more
Attachments
Attachments
Issue Links
- is duplicated by
-
FLINK-24792 OperatorCoordinatorSchedulerTest crashed JVM on AZP
- Closed
-
FLINK-24938 Checkpoint cleaner is closed before checkpoints are discarded
- Closed
- links to