[FLINK-5960] Make CheckpointCoordinator less blocking - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.2.0, 1.3.0, 1.9.0, 1.10.0
Fix Version/s: None
Component/s: Runtime / State Backends
Labels:
- stale-major

Description

Currently the CheckpointCoordinator locks its operation under a global lock. This also includes writing checkpoint data out to a state storage. If this operation blocks, then the whole checkpoint operator stands still. I think we should rework the CheckpointCoordinator to make fewer assumptions about external systems to tolerate write failures and timeouts. Furthermore, we should try to limit the scope of locking and the execution of potentially blocking operation under the lock. This will improve the runtime behaviour of the CheckpointCoordinator.

Attachments

Issue Links

duplicates

FLINK-13698 Rework threading model of CheckpointCoordinator

Reopened

is related to

FLINK-13698 Rework threading model of CheckpointCoordinator

Reopened

relates to

FLINK-13497 Checkpoints can complete after CheckpointFailureManager fails job

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Till Rohrmann

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 03/Mar/17 16:08

Updated:: 23/Apr/21 08:39

Resolved:: 23/Apr/21 08:39