Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Duplicate
-
1.2.0, 1.3.0, 1.9.0, 1.10.0
-
None
Description
Currently the CheckpointCoordinator locks its operation under a global lock. This also includes writing checkpoint data out to a state storage. If this operation blocks, then the whole checkpoint operator stands still. I think we should rework the CheckpointCoordinator to make fewer assumptions about external systems to tolerate write failures and timeouts. Furthermore, we should try to limit the scope of locking and the execution of potentially blocking operation under the lock. This will improve the runtime behaviour of the CheckpointCoordinator.
Attachments
Issue Links
- duplicates
-
FLINK-13698 Rework threading model of CheckpointCoordinator
- Reopened
- is related to
-
FLINK-13698 Rework threading model of CheckpointCoordinator
- Reopened
- relates to
-
FLINK-13497 Checkpoints can complete after CheckpointFailureManager fails job
- Closed