[FLINK-18641] "Failure to finalize checkpoint" error in MasterTriggerRestoreHook - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.11.0
Fix Version/s: 1.11.2, 1.12.0
Component/s: Runtime / Checkpointing
Labels:
- pull-request-available

Description

https://github.com/pravega/flink-connectors is a Pravega connector for Flink. The ReaderCheckpointHook[1] class uses the Flink `MasterTriggerRestoreHook` interface to trigger the Pravega checkpoint during Flink checkpoints to make sure the data recovery. The checkpoint recovery tests are running fine in Flink 1.10, but it has below issues in Flink 1.11 causing the tests time out. Suspect it is related to the checkpoint coordinator thread model changes in Flink 1.11

Error stacktrace:

2020-07-09 15:39:39,999 30945 [jobmanager-future-thread-5] WARN  o.a.f.runtime.jobmaster.JobMaster - Error while processing checkpoint acknowledgement message
org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize the pending checkpoint 3. Failure reason: Failure to finalize checkpoint.
         at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1033)
         at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:948)
         at org.apache.flink.runtime.scheduler.SchedulerBase.lambda$acknowledgeCheckpoint$4(SchedulerBase.java:802)
         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
         at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.util.SerializedThrowable: Pending checkpoint has not been fully acknowledged yet
         at org.apache.flink.util.Preconditions.checkState(Preconditions.java:195)
         at org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:298)
         at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1021)
         ... 9 common frames omitted

More detail in this mailing thread: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Pravega-connector-cannot-recover-from-the-checkpoint-due-to-quot-Failure-to-finalize-checkpoint-quot-td36652.html
Also in https://github.com/pravega/flink-connectors/issues/387

Attachments

Issue Links

links to

GitHub Pull Request #13044

Activity

People

Assignee:: Jiangjie Qin

Reporter:: Brian Zhou

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 20/Jul/20 07:10

Updated:: 09/Sep/20 06:34

Resolved:: 09/Sep/20 01:33