[FLINK-7873] Introduce CheckpointCacheManager for reading checkpoint data locally when performing failover - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.3.2
Fix Version/s: None
Component/s: Runtime / State Backends
Labels:
None

Description

Why i introduce this:
Current recover strategy will always read checkpoint data from remote FileStream (HDFS). This will cost a lot of bandwidth when the state is so big (e.g. 1T). What's worse, if this job performs recover again and again, it can eat up all network bandwidth and do a huge hurt to cluster. So, I proposed that we can cache the checkpoint data locally, and read checkpoint data from local cache as well as we can, we read the data from remote only if we fail locally. The advantage is that if a execution is assigned to the same TaskManager as before, it can save a lot of bandwith, and obtain a faster recover.

Solution:
TaskManager do the cache job and manage the cached data itself. It simple use a TTL-like method to manage cache entry's dispose, we dispose a entry if it wasn't be touched for a X time, once we touch a entry we reset the TTL for it. In this way, all jobs is done by TaskManager, it transparent to JobManager. The only problem is that we may dispose a entry that maybe useful, in this case, we have to read from remote data finally, but users can avoid this by set a proper TTL value according to checkpoint interval and other things.

Can someone give me some advice? I would appreciate it very much~

Attachments

Issue Links

is superceded by

FLINK-8360 Implement task-local state recovery

Closed

links to

GitHub Pull Request #5074

Local recovery

Activity

People

Assignee:: Sihua Zhou

Reporter:: Sihua Zhou

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 19/Oct/17 11:43

Updated:: 28/Feb/18 11:16

Resolved:: 28/Feb/18 11:16