[FLINK-13698] Rework threading model of CheckpointCoordinator - ASF JIRA

Attach files

Attach Screenshot

Add vote

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Reopened
Priority: Not a Priority
Resolution: Unresolved
Affects Version/s: 1.10.0
Fix Version/s: None
Component/s: Runtime / Checkpointing
Labels:

Description

Currently CheckpointCoordinator and CheckpointFailureManager code is executed by multiple different threads (mostly ioExecutor, but not only). It's causing multiple concurrency issues, for example: https://issues.apache.org/jira/browse/FLINK-13497

Proper fix would be to rethink threading model there. At first glance it doesn't seem that this code should be multi threaded, except of parts doing the actual IO operations, so it should be possible to run everything in one single ExecutionGraph's thread and just run asynchronously necessary IO operations with some feedback loop ("mailbox style").

I would strongly recommend fixing this issue before adding new features in the CheckpointCoordinator component.

Attachments

Attachments

Issue Links

Add Link

causes

FLINK-13497 Checkpoints can complete after CheckpointFailureManager fails job

Closed

Delete this link

is duplicated by

FLINK-5960 Make CheckpointCoordinator less blocking

Closed

Delete this link

is related to

FLINK-15132 Checkpoint Coordinator does Checkpoint I/O in JobMaster Main Thread

Closed

Delete this link

relates to

FLINK-19401 Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster

Resolved

Delete this link

FLINK-26306 [Changelog] Thundering herd problem with materialization

Resolved

Delete this link

FLINK-26590 Triggered checkpoints can be delayed by discarding shared state

Open

Delete this link

FLINK-16931 Large _metadata file lead to JobManager not responding when restart

Open

Delete this link

FLINK-5960 Make CheckpointCoordinator less blocking

Closed

Delete this link

links to

Refactor Thread Model of CheckpointCoordinator

Delete this link

(3 relates to, 1 links to)

Sub-Tasks

Create Sub-Task

1.

Avoid competition between different rounds of checkpoint triggering

Closed

Biao Liu

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 20m

2.

Separate checkpoint triggering into stages

Closed

Biao Liu

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 20m

3.

A preparation for snapshotting master hook state asynchronously

Closed

Biao Liu

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 20m

4.

Make all the non-IO operations in CheckpointCoordinator single-threaded

Reopened

Unassigned

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 20m

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Unassigned

Reporter:: Piotr Nowojski

Votes:: 1 Vote for this issue

Watchers:: 19 Start watching this issue

Dates

Created:: 12/Aug/19 10:35

Updated:: 17/Aug/23 13:27

Time Tracking

Estimated:

Not Specified

Remaining:

0h

Logged:

1h 20m

Include sub-tasks

Agile

Slack

Issue deployment