Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-18641

"Failure to finalize checkpoint" error in MasterTriggerRestoreHook

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      https://github.com/pravega/flink-connectors is a Pravega connector for Flink. The ReaderCheckpointHook[1] class uses the Flink `MasterTriggerRestoreHook` interface to trigger the Pravega checkpoint during Flink checkpoints to make sure the data recovery. The checkpoint recovery tests are running fine in Flink 1.10, but it has below issues in Flink 1.11 causing the tests time out. Suspect it is related to the checkpoint coordinator thread model changes in Flink 1.11

      Error stacktrace:

      2020-07-09 15:39:39,999 30945 [jobmanager-future-thread-5] WARN  o.a.f.runtime.jobmaster.JobMaster - Error while processing checkpoint acknowledgement message
      org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize the pending checkpoint 3. Failure reason: Failure to finalize checkpoint.
               at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1033)
               at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:948)
               at org.apache.flink.runtime.scheduler.SchedulerBase.lambda$acknowledgeCheckpoint$4(SchedulerBase.java:802)
               at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
               at java.util.concurrent.FutureTask.run(FutureTask.java:266)
               at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
               at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
               at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
               at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
               at java.lang.Thread.run(Thread.java:748)
      Caused by: org.apache.flink.util.SerializedThrowable: Pending checkpoint has not been fully acknowledged yet
               at org.apache.flink.util.Preconditions.checkState(Preconditions.java:195)
               at org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:298)
               at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1021)
               ... 9 common frames omitted
      

      More detail in this mailing thread: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Pravega-connector-cannot-recover-from-the-checkpoint-due-to-quot-Failure-to-finalize-checkpoint-quot-td36652.html
      Also in https://github.com/pravega/flink-connectors/issues/387

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            becket_qin Jiangjie Qin
            Brian Zhou Brian Zhou
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment