Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-19816

Flink restored from a wrong checkpoint (a very old one and not the last completed one)

    XMLWordPrintableJSON

Details

    Description

      Summary

      Upon failure, it seems that Flink didn't restore from the last completed checkpoint. Instead, it restored from a very old checkpoint. As a result, Kafka offsets are invalid and caused the job to replay from the beginning as Kafka consumer "auto.offset.reset" was set to "EARLIEST".

      This is an embarrassingly parallel stateless job. Parallelism is over 1,000. I have the full log file from jobmanager at INFO level available upon request.

      Sequence of events from the logs

      Just before the failure, checkpoint 210768 completed.

      2020-10-25 02:35:05,970 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [jobmanager-future-thread-5] - Completed checkpoint 210768 for job 233b4938179c06974e4535ac8a868675 (4623776 bytes in 120402 ms).
      

      During restart, somehow it decided to restore from a very old checkpoint 203531.

      2020-10-25 02:36:03,301 INFO  org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [cluster-io-thread-3]  - Start SessionDispatcherLeaderProcess.
      2020-10-25 02:36:03,302 INFO  org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [cluster-io-thread-5]  - Recover all persisted job graphs.
      2020-10-25 02:36:03,304 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem                         [cluster-io-thread-25]  - Deleting path: s3://<bucket>/checkpoints/XM3B/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/c31aec1e-07a7-4193-aa00-3fbe83f9e2e6
      2020-10-25 02:36:03,307 INFO  org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [cluster-io-thread-5]  - Trying to recover job with job id 233b4938179c06974e4535ac8a868675.
      
      2020-10-25 02:36:03,381 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem                         [cluster-io-thread-25]  - Deleting path: s3://<bucket>/checkpoints/Hh86/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/4ab92f70-dfcd-4212-9b7f-bdbecb9257fd
      ...
      2020-10-25 02:36:03,427 INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore [flink-akka.actor.default-dispatcher-82003]  - Recovering checkpoints from ZooKeeper.
      2020-10-25 02:36:03,432 INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore [flink-akka.actor.default-dispatcher-82003]  - Found 0 checkpoints in ZooKeeper.
      2020-10-25 02:36:03,432 INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore [flink-akka.actor.default-dispatcher-82003]  - Trying to fetch 0 checkpoints from storage.
      2020-10-25 02:36:03,432 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [flink-akka.actor.default-dispatcher-82003]  - Starting job 233b4938179c06974e4535ac8a868675 from savepoint s3://<bucket>/checkpoints/metadata/clapp_avro-clapp_avro_nontvui/1113/47e2a25a8d0b696c7d0d423722bb6f54/chk-203531/_metadata ()
      

      Attachments

        1. jm.log
          85 kB
          Paul Lin

        Issue Links

          Activity

            People

              trohrmann Till Rohrmann
              stevenz3wu Steven Zhen Wu
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: