Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.18.0
Description
When a task manager dies while the JM is generating an ExecutionGraph in the background then CreatingExecutionGraph#handleExecutionGraphCreation can transition back into WaitingForResources if the TM hosted one of the slots that we planned to use in tryToAssignSlots.
At this point the ExecutionGraph was already transitioned to running, which implicitly kicks of periodic checkpointing by the CheckpointCoordinator, without the operator coordinator holders being initialized yet (as this happens after we assigned slots).
This effectively leaks that CheckpointCoordinator, including the timer thread that will continue to try triggering checkpoints, which will naturally fail to trigger.
This can cause a JM crash because it results in OperatorCoordinatorHolder#abortCurrentTriggering to be called, which fails with an NPE since the mainThreadExecutor was not initialized yet.
java.util.concurrent.CompletionException: java.util.concurrent.CompletionException: java.lang.NullPointerException at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$8(CheckpointCoordinator.java:707) at java.base/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986) at java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970) at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) at java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610) at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:910) at java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: java.util.concurrent.CompletionException: java.lang.NullPointerException at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314) at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319) at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:932) at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907) ... 7 more Caused by: java.lang.NullPointerException at org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.abortCurrentTriggering(OperatorCoordinatorHolder.java:388) at java.base/java.util.ArrayList.forEach(ArrayList.java:1541) at java.base/java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1085) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:985) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:961) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$7(CheckpointCoordinator.java:693) at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) ... 8 more
Attachments
Issue Links
- links to