[FLINK-21251] Last valid checkpoint metadata lost after job exits restart loop - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Not A Problem
Affects Version/s: 1.7.2
Fix Version/s: None
Component/s: Runtime / Checkpointing
Labels:
None

Description

We have a Flink job of a relatively old version, 1.7.1, that failed with no valid checkpoint to restore. The job was first affected by a Kafka network instability and fell into the restart loop with the policy of 3 restarts in 5 minutes. After the restarts exhausted, the job turned into the final state FAILED and exits. But the problem is that the last valid checkpoint 4585 that was restored multiple times during the restarts, was corrupted (no _metadata) after the job exited.

I've checked the checkpoint dir on HDFS and found that chk-4585 which was finished at 12:16 was modified at 12:23 when jobmanager was shutting down with lots of error logs saying the deletes of pending checkpoints somehow failed. So I'm suspecting that the checkpoint metadata was unexpectedly deleted by jobmanager.

The jobmanager logs are attached below.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

ch-4585 content.png
03/Feb/21 11:33
416 kB
Paul Lin
checkpoint dir.png
03/Feb/21 11:27
295 kB
Paul Lin
jm_logs
03/Feb/21 09:26
2.09 MB
Paul Lin

Activity

People

Assignee:: Unassigned

Reporter:: Paul Lin

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 03/Feb/21 09:27

Updated:: 03/Feb/21 12:03

Resolved:: 03/Feb/21 12:03