Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-23874

JM did not store latest checkpiont id into Zookeeper, silently

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 1.12.1
    • None
    • None

    Description

      Job manager did not update the latest successful checkpoint id into zookeeper (with ZK HA setup), at path /flink/{app_id}/checkpoints/, when JM restart, the job resumed from a very old position.

       

      We had a job which was resumed from save point 258, after running for a few days, the latest successful checkpoint was about chk 686. When something trigged the JM to restart, it restored state to save point 258, instead of chk 686.

      We checked zookeeper, indeed the stored checkpoint was still 258, which means JM hasn't stored checkpoint id into zookeeper for few days, and without any error message.

       

      below are the relevant logs around the restart:

      2021-08-18 11:09:16,505 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 686 for job 00000000000000000000000000000000 (228296 bytes in 827 ms).

       

      2021-08-18 11:10:13,066 INFO org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation [] - Finished restoring from state handle: IncrementalRemoteKeyedStateHandle{backendIdentifier=c11d290c-617b-4ea5-b7ed-4853272f32a3, keyGroupRange=KeyGroupRange

      Unknown macro: {startKeyGroup=47, endKeyGroup=48}

      , checkpointId=258, sharedState={}, privateState={OPTIONS-000016=ByteStreamStateHandle

      Unknown macro: {handleName='s3}

      , MANIFEST-000006=ByteStreamStateHandle{handleName='s3://dp-flink/prd/checkpoints/3beb4dd5-9008-4fd0-9910-f564759b466a/1628912020683/00000000000000000000000000000000/shared/3c8e1c2f-616d-4f18-8b07-4a818e3ca110', dataBytes=336}, CURRENT=ByteStreamStateHandle{handleName='s3://dp-flink/prd/checkpoints/3beb4dd5-9008-4fd0-9910-f564759b466a/1628912020683/00000000000000000000000000000000/shared/1dc5f341-8a73-4e69-96fb-4b026653da6d', dataBytes=16}}, metaStateHandle=ByteStreamStateHandle{handleName='s3://dp-flink/prd/checkpoints/3beb4dd5-9008-4fd0-9910-f564759b466a/1628912020683/00000000000000000000000000000000/chk-258/0ef57eb3-0f38-45f5-8f3d-3e7b87f5fd15', dataBytes=1704}, registered=false} without rescaling.

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              ubyyj Youjun Yuan
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: