Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-35499

EventTimeWindowCheckpointingITCase times out due to Checkpoint expired before completing

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 1.20.0
    • None
    • None

    Description

      Going into the logs, we see the following error occurs:

      ================================================================================
      Test testTumblingTimeWindow[statebackend type =ROCKSDB_INCREMENTAL, buffersPerChannel = 2](org.apache.flink.test.checkpointing.EventTimeWindowCheckpointingITCase) is running.
      --------------------------------------------------------------------------------
      <...>
      20:24:23,562 [    Checkpoint Timer] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 22 (type=CheckpointType{name='Checkpoint', sharingFilesStrategy=FORWARD_BACKWARD}) @ 1716927863562 for job 15d0a663cb415b09b9a68ccc40640c6d.
      20:24:23,609 [jobmanager-io-thread-2] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed checkpoint 22 for job 15d0a663cb415b09b9a68ccc40640c6d (2349132 bytes, checkpointDuration=43 ms, finalizationTime=4 ms).
      20:24:23,610 [    Checkpoint Timer] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 23 (type=CheckpointType{name='Checkpoint', sharingFilesStrategy=FORWARD_BACKWARD}) @ 1716927863610 for job 15d0a663cb415b09b9a68ccc40640c6d.
      20:24:23,620 [jobmanager-io-thread-2] WARN  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Error while processing AcknowledgeCheckpoint message
      java.lang.IllegalStateException: Attempt to reference unknown state: a9a90973-4ee5-384f-acef-58a7c7560920
      	at org.apache.flink.util.Preconditions.checkState(Preconditions.java:193) ~[flink-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      	at org.apache.flink.runtime.state.SharedStateRegistryImpl.registerReference(SharedStateRegistryImpl.java:97) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      	at org.apache.flink.runtime.state.SharedStateRegistry.registerReference(SharedStateRegistry.java:53) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      	at org.apache.flink.runtime.state.IncrementalRemoteKeyedStateHandle.registerSharedStates(IncrementalRemoteKeyedStateHandle.java:289) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      	at org.apache.flink.runtime.checkpoint.OperatorSubtaskState.registerSharedState(OperatorSubtaskState.java:243) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      	at org.apache.flink.runtime.checkpoint.OperatorSubtaskState.registerSharedStates(OperatorSubtaskState.java:226) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      	at org.apache.flink.runtime.checkpoint.TaskStateSnapshot.registerSharedStates(TaskStateSnapshot.java:193) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      	at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1245) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      	at org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$2(ExecutionGraphHandler.java:109) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      	at org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$4(ExecutionGraphHandler.java:139) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      	at org.apache.flink.util.MdcUtils.lambda$wrapRunnable$1(MdcUtils.java:64) ~[flink-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_392]
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_392]
      	at java.lang.Thread.run(Thread.java:750) [?:1.8.0_392]
      20:24:23,663 [Source: Custom Source (1/1)#1] INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Source: Custom Source (1/1)#1 (bc4de0d149fba0ca825771ff7eeae08d_bc764cd8ddf7a0cff126f51c16239658_0_1) switched from RUNNING to FINISHED.
      20:24:23,663 [Source: Custom Source (1/1)#1] INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Freeing task resources for Source: Custom Source (1/1)#1 (bc4de0d149fba0ca825771ff7eeae08d_bc764cd8ddf7a0cff126f51c16239658_0_1).
      20:24:23,663 [flink-pekko.actor.default-dispatcher-8] INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Un-registering task and sending final execution state FINISHED to JobManager for task Source: Custom Source (1/1)#1 bc4de0d149fba0ca825771ff7eeae08d_bc764cd8ddf7a0cff126f51c16239658_0_1.
      20:24:23,663 [flink-pekko.actor.default-dispatcher-8] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: Custom Source (1/1) (bc4de0d149fba0ca825771ff7eeae08d_bc764cd8ddf7a0cff126f51c16239658_0_1) switched from RUNNING to FINISHED.
      20:24:23,663 [flink-pekko.actor.default-dispatcher-8] INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Triggering a manual checkpoint for job 15d0a663cb415b09b9a68ccc40640c6d.
      20:24:23,664 [TumblingEventTimeWindows (3/4)#1] INFO  org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend [] - Closed RocksDB State Backend. Cleaning up RocksDB working directory /tmp/junit4600599408056781950/junit4118530363439384731/job_15d0a663cb415b09b9a68ccc40640c6d_op_WindowOperator_0a448493b4782967b150582570326227__3_4__uuid_f9b935a1-38cc-4cb3-92f0-97bf257d499d.
      20:24:23,665 [TumblingEventTimeWindows (3/4)#1] INFO  org.apache.flink.runtime.taskmanager.Task                    [] - TumblingEventTimeWindows (3/4)#1 (bc4de0d149fba0ca825771ff7eeae08d_0a448493b4782967b150582570326227_2_1) switched from RUNNING to FINISHED.
      20:24:23,665 [TumblingEventTimeWindows (3/4)#1] INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Freeing task resources for TumblingEventTimeWindows (3/4)#1 (bc4de0d149fba0ca825771ff7eeae08d_0a448493b4782967b150582570326227_2_1).
      20:24:23,665 [flink-pekko.actor.default-dispatcher-8] INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Un-registering task and sending final execution state FINISHED to JobManager for task TumblingEventTimeWindows (3/4)#1 bc4de0d149fba0ca825771ff7eeae08d_0a448493b4782967b150582570326227_2_1.
      20:24:23,665 [flink-pekko.actor.default-dispatcher-8] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - TumblingEventTimeWindows (3/4) (bc4de0d149fba0ca825771ff7eeae08d_0a448493b4782967b150582570326227_2_1) switched from RUNNING to FINISHED.
      20:24:23,665 [flink-pekko.actor.default-dispatcher-10] INFO  org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager [] - Received resource requirements from job 15d0a663cb415b09b9a68ccc40640c6d: [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, numberOfRequiredSlots=3}]
      20:26:01,189 [flink-pekko.actor.default-dispatcher-16] INFO  org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool [] - Releasing slot [27b0503b2930b17041ffd3635dfb819f].
      20:26:01,190 [flink-pekko.actor.default-dispatcher-16] INFO  org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl [] - Free slot TaskSlot(index:3, state:ACTIVE, resource profile: ResourceProfile{taskHeapMemory=512.000gb (549755813888 bytes), taskOffHeapMemory=512.000gb (549755813888 bytes), managedMemory=64.000mb (67108864 bytes), networkMemory=32.000mb (33554432 bytes)}, allocationId: 27b0503b2930b17041ffd3635dfb819f, jobId: 15d0a663cb415b09b9a68ccc40640c6d).
      20:26:01,205 [flink-pekko.actor.default-dispatcher-16] INFO  org.apache.flink.runtime.resourcemanager.slotmanager.DefaultSlotStatusSyncer [] - Freeing slot 27b0503b2930b17041ffd3635dfb819f.
      20:34:23,610 [    Checkpoint Timer] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Checkpoint 23 of job 15d0a663cb415b09b9a68ccc40640c6d expired before completing.
      20:34:23,611 [    Checkpoint Timer] WARN  org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to trigger or complete checkpoint 23 for job 15d0a663cb415b09b9a68ccc40640c6d. (0 consecutive failed attempts so far)
      org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint expired before completing.
      	at org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:2346) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      	at org.apache.flink.util.MdcUtils.lambda$wrapRunnable$1(MdcUtils.java:64) ~[flink-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_392]
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_392]
      	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_392]
      	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_392]
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_392]
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_392]
      	at java.lang.Thread.run(Thread.java:750) [?:1.8.0_392]
      20:34:23,611 [    Checkpoint Timer] INFO  org.apache.flink.runtime.checkpoint.CheckpointRequestDecider [] - checkpoint request time in queue: 599948
      20:34:23,612 [flink-pekko.actor.default-dispatcher-130] INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Trying to recover from a global failure.
      org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable failure threshold. The latest checkpoint failed due to Checkpoint expired before completing., view the Checkpoint History tab or the Job Manager log to find out why continuous checkpoints failed.
      	at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.checkFailureAgainstCounter(CheckpointFailureManager.java:212) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      	at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:169) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      	at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:122) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      	at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:2281) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      	at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:2260) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      	at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$1200(CheckpointCoordinator.java:102) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      	at org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:2346) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      	at org.apache.flink.util.MdcUtils.lambda$wrapRunnable$1(MdcUtils.java:64) ~[flink-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_392]
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_392]
      	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_392]
      	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[?:1.8.0_392]
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_392]
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_392]
      	at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_392]
      20:34:23,613 [flink-pekko.actor.default-dispatcher-130] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job Tumbling Window Test (15d0a663cb415b09b9a68ccc40640c6d) switched from state RUNNING to FAILING.
      org.apache.flink.runtime.JobException: Recovery is suppressed by FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=1, backoffTimeMS=0)
      	at org.apache.flink.runtime.executiongraph.failover.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:219) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      	at org.apache.flink.runtime.executiongraph.failover.ExecutionFailureHandler.handleFailureAndReport(ExecutionFailureHandler.java:166) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      	at org.apache.flink.runtime.executiongraph.failover.ExecutionFailureHandler.getGlobalFailureHandlingResult(ExecutionFailureHandler.java:140) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      	at org.apache.flink.runtime.scheduler.DefaultScheduler.handleGlobalFailure(DefaultScheduler.java:324) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      	at org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyGlobalFailure(UpdateSchedulerNgOnInternalFailuresListener.java:57) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      	at org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.failGlobal(DefaultExecutionGraph.java:1092) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      	at org.apache.flink.runtime.executiongraph.DefaultExecutionGraph$1.lambda$failJob$0(DefaultExecutionGraph.java:477) ~[flink-runtime-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      	at org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:460) ~[flink-rpc-akka44fe192f-1cb4-4526-9c09-a5d6b8d70d87.jar:1.20-SNAPSHOT]
      	at org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) ~[flink-rpc-core-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      <...> 
      

      Notably: org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable failure threshold. The latest checkpoint failed due to Checkpoint expired before completing., view the Checkpoint History tab or the Job Manager log to find out why continuous checkpoints failed.

      Attachments

        Activity

          People

            Unassigned Unassigned
            rskraba Ryan Skraba
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: