[FLINK-23874] JM did not store latest checkpiont id into Zookeeper, silently - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.12.1
Fix Version/s: None
Component/s: Runtime / Checkpointing
Labels:
None

Description

Job manager did not update the latest successful checkpoint id into zookeeper (with ZK HA setup), at path /flink/{app_id}/checkpoints/, when JM restart, the job resumed from a very old position.

We had a job which was resumed from save point 258, after running for a few days, the latest successful checkpoint was about chk 686. When something trigged the JM to restart, it restored state to save point 258, instead of chk 686.

We checked zookeeper, indeed the stored checkpoint was still 258, which means JM hasn't stored checkpoint id into zookeeper for few days, and without any error message.

below are the relevant logs around the restart:

2021-08-18 11:09:16,505 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 686 for job 00000000000000000000000000000000 (228296 bytes in 827 ms).

2021-08-18 11:10:13,066 INFO org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation [] - Finished restoring from state handle: IncrementalRemoteKeyedStateHandle{backendIdentifier=c11d290c-617b-4ea5-b7ed-4853272f32a3, keyGroupRange=KeyGroupRange

Unknown macro: {startKeyGroup=47, endKeyGroup=48}

, checkpointId=258, sharedState={}, privateState={OPTIONS-000016=ByteStreamStateHandle

Unknown macro: {handleName='s3}

, MANIFEST-000006=ByteStreamStateHandle{handleName='s3://dp-flink/prd/checkpoints/3beb4dd5-9008-4fd0-9910-f564759b466a/1628912020683/00000000000000000000000000000000/shared/3c8e1c2f-616d-4f18-8b07-4a818e3ca110', dataBytes=336}, CURRENT=ByteStreamStateHandle{handleName='s3://dp-flink/prd/checkpoints/3beb4dd5-9008-4fd0-9910-f564759b466a/1628912020683/00000000000000000000000000000000/shared/1dc5f341-8a73-4e69-96fb-4b026653da6d', dataBytes=16}}, metaStateHandle=ByteStreamStateHandle{handleName='s3://dp-flink/prd/checkpoints/3beb4dd5-9008-4fd0-9910-f564759b466a/1628912020683/00000000000000000000000000000000/chk-258/0ef57eb3-0f38-45f5-8f3d-3e7b87f5fd15', dataBytes=1704}, registered=false} without rescaling.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

container_e04_1628083845581_0254_02_000001_jm.log
07/Sep/21 09:37
455 kB
Youjun Yuan
container_e04_1628083845581_0254_01_000001_jm.log
07/Sep/21 09:37
715 kB
Youjun Yuan
container_e04_1628083845581_0254_01_000050_tm.log
07/Sep/21 09:37
784 kB
Youjun Yuan

Issue Links

duplicates

FLINK-20992 Checkpoint cleanup can kill JobMaster

Closed

is related to

FLINK-11813 Standby per job mode Dispatchers don't know job's JobSchedulingStatus

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Youjun Yuan

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 19/Aug/21 10:15

Updated:: 01/Oct/21 10:31

Resolved:: 08/Sep/21 07:33