[FLINK-19778] Failed job reinitiated with wrong checkpoint after a ZK reconnection - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Duplicate
Affects Version/s: 1.11.0
Fix Version/s: None
Component/s: Runtime / Checkpointing, Runtime / Coordination
Labels:
None

Description

We have a job of Flink 1.11.0 running on YARN that reached FAILED state because its jobmanager lost leadership during a ZK full GC. But after the ZK connection was recovered, somehow the job was reinitiated again with no checkpoints found in ZK, and hence an earlier savepoint was used to restore the job, which rewound the job unexpectedly.

For details please see the jobmanager logs in the attachment.

Attachments

jm_log
23/Oct/20 06:35
46 kB
Paul Lin

Issue Links

Add Link

duplicates

FLINK-19816 Flink restored from a wrong checkpoint (a very old one and not the last completed one)

Closed

Delete this link

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Unassigned

Reporter:: Paul Lin

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 23/Oct/20 06:37

Updated:: 09/Nov/20 10:15

Resolved:: 09/Nov/20 10:15

Agile

View on Board

Failed job reinitiated with wrong checkpoint after a ZK reconnection

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Agile

Slack

Issue deployment