Details
-
Improvement
-
Status: Reopened
-
Not a Priority
-
Resolution: Unresolved
-
1.10.0
-
None
Description
Currently CheckpointCoordinator and CheckpointFailureManager code is executed by multiple different threads (mostly ioExecutor, but not only). It's causing multiple concurrency issues, for example: https://issues.apache.org/jira/browse/FLINK-13497
Proper fix would be to rethink threading model there. At first glance it doesn't seem that this code should be multi threaded, except of parts doing the actual IO operations, so it should be possible to run everything in one single ExecutionGraph's thread and just run asynchronously necessary IO operations with some feedback loop ("mailbox style").
I would strongly recommend fixing this issue before adding new features in the CheckpointCoordinator component.
Attachments
Issue Links
- causes
-
FLINK-13497 Checkpoints can complete after CheckpointFailureManager fails job
- Closed
- is duplicated by
-
FLINK-5960 Make CheckpointCoordinator less blocking
- Closed
- is related to
-
FLINK-15132 Checkpoint Coordinator does Checkpoint I/O in JobMaster Main Thread
- Closed
- relates to
-
FLINK-19401 Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster
- Resolved
-
FLINK-26306 [Changelog] Thundering herd problem with materialization
- Resolved
-
FLINK-26590 Triggered checkpoints can be delayed by discarding shared state
- Open
-
FLINK-16931 Large _metadata file lead to JobManager not responding when restart
- Open
-
FLINK-5960 Make CheckpointCoordinator less blocking
- Closed
- links to
1.
|
Avoid competition between different rounds of checkpoint triggering | Closed | Biao Liu |
|
||||||||
2.
|
Separate checkpoint triggering into stages | Closed | Biao Liu |
|
||||||||
3.
|
A preparation for snapshotting master hook state asynchronously | Closed | Biao Liu |
|
||||||||
4.
|
Make all the non-IO operations in CheckpointCoordinator single-threaded | Reopened | Unassigned |
|