Details
-
Bug
-
Status: Closed
-
Blocker
-
Resolution: Duplicate
-
1.15.0
Description
Acknowledge of a checkpoint failed, then the checkpoint expired, then checkpoint failure threshold was reached and job failed.
Randomly selected true for execution.checkpointing.unaligned Randomly selected PT2S for execution.checkpointing.alignment-timeout Randomly selected true for state.backend.changelog.enabled Randomly selected PT0.1S for state.backend.changelog.periodic-materialize.interval
[ERROR] Tests run: 64, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 700.545 s <<< FAILURE! - in org.apache.flink.table.planner.runtime.stream.sql.SplitAggregateITCase [ERROR] SplitAggregateITCase.testAggWithJoin Time elapsed: 601.77 s <<< ERROR! org.apache.flink.runtime.client.JobExecutionException: Job execution failed. at org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144) at org.apache.flink.runtime.minicluster.MiniClusterJobClient.lambda$getJobExecutionResult$3(MiniCl usterJobClient.java:141) at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616) at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) at org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$1(AkkaInvocationHandle r.java:259) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) at org.apache.flink.util.concurrent.FutureUtils.doForward(FutureUtils.java:1389) at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$null$1(ClassLoadingUtils.java :93) at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadi ngUtils.java:68) at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$guardCompletionWithContextCla ssLoader$2(ClassLoadingUtils.java:92) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) at org.apache.flink.runtime.concurrent.akka.AkkaFutureUtils$1.onComplete(AkkaFutureUtils.java:47) at akka.dispatch.OnComplete.internal(Future.scala:300) at akka.dispatch.OnComplete.internal(Future.scala:297) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:224) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:221) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) at org.apache.flink.runtime.concurrent.akka.AkkaFutureUtils$DirectExecutionContext.execute(AkkaFut ureUtils.java:65) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68) at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284) at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284) ... Caused by: org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable failure threshold. at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.checkFailureAgainstCounter(Checkpo intFailureManager.java:160) at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException( CheckpointFailureManager.java:123) at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(Checkpoi ntFailureManager.java:90) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoor dinator.java:2046) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoor dinator.java:2025) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$600(CheckpointCoordinator.java :98) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoo rdinator.java:2104) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThread PoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExe cutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
12:18:11,760 [jobmanager-io-thread-5] WARN org.apache.flink.runtime.jobmaster.JobMaster [] - Error while processing AcknowledgeCheckpoint message java.lang.IllegalStateException: Attempt to reference unknown state: 4a798990-1428-424c-813a-2ec1c4fcee8f-KeyGroupRange{startKeyGroup=0, endKeyGroup=31}-000019.sst at org.apache.flink.util.Preconditions.checkState(Preconditions.java:193) ~[flink-core-1.15-SNAPSHOT.jar:1.15-SNAPSHOT] at org.apache.flink.runtime.state.SharedStateRegistryImpl.registerReference(SharedStateRegistryImpl.java:82) ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT] at org.apache.flink.runtime.state.IncrementalRemoteKeyedStateHandle.registerSharedStates(IncrementalRemoteKeyedStateHandle.java:317) ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT] at org.apache.flink.runtime.state.SharedStateRegistryImpl.registerAll(SharedStateRegistryImpl.java:172) ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT] at org.apache.flink.runtime.state.changelog.ChangelogStateBackendHandle$ChangelogStateBackendHandleImpl.registerSharedStates(ChangelogStateBackendHandle.java:124) ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT] at org.apache.flink.runtime.checkpoint.OperatorSubtaskState.registerSharedState(OperatorSubtaskState.java:229) ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT] at org.apache.flink.runtime.checkpoint.OperatorSubtaskState.registerSharedStates(OperatorSubtaskState.java:219) ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT] at org.apache.flink.runtime.checkpoint.TaskStateSnapshot.registerSharedStates(TaskStateSnapshot.java:189) ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1114) ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT] at org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89) ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT] at org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119) ~[flink-runtime-1.15-SNAPSHOT.jar:1.15-SNAPSHOT] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_292] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_292] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]
Attachments
Issue Links
- blocks
-
FLINK-21352 FLIP-158: Generalized incremental checkpoints
- Resolved
- duplicates
-
FLINK-26231 [Changelog] Incorrect MaterializationID passed to ChangelogStateBackendHandleImpl
- Resolved