XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Won't Do
Affects Version/s: None
Fix Version/s: None
Component/s: Runtime / Checkpointing
Labels:
- auto-deprioritized-major
- pull-request-available

Description

Backgroud

We have many jobs with large state size in production environment. According to the operation practice of these jobs and the analysis of some specific problems, we believe that RocksDBStateBackend's incremental checkpoint has many advantages over savepoint：

Savepoint takes much longer time then incremental checkpoint in jobs with large state. The figure below is a job in our production environment, it takes nearly 7 minutes to complete a savepoint, while checkpoint only takes a few seconds.( checkpoint after savepoint takes longer time is a problem described in ~~FLINK-23949~~)
Savepoint causes excessive cpu usage. The figure below shows the CPU usage of the same job in the above figure :
Savepoint may cause excessive native memory usage and eventually cause the TaskManager process memory usage to exceed the limit. (We did not further investigate the cause and did not try to reproduce the problem on other large state jobs, but only increased the overhead memory. So this reason may not be so conclusive. )

For the above reasons, we tend to use retained incremental checkpoint to completely replace savepoint for jobs with large state size.

Problems

Problem 1 : retained incremental checkpoint difficult to clean up once they used for recovery
This problem caused by jobs recoveryed from a retained incremental checkpoint may reference files on this retained incremental checkpoint's shared directory in subsequent checkpoints, even they are not in a same job instance. The worst case is that the retained checkpoint will be referenced one by one, forming a very long reference chain.This makes it difficult for users to manage retained checkpoints. In fact, we have also suffered failures caused by incorrect deletion of retained checkpoints.
Although we can use the file handle in checkpoint metadata to figure out which files can be deleted, but I think it is inappropriate to let users do this.

Problem 2 : checkpoint not relocatable
Even if we can figure out all files referenced by a checkpoint, moving these files will invalidate the checkpoint as well, because the metadata file references absolute file paths.
Since savepoint already be self-contained and relocatable (~~FLINK-5763~~), why don't we use savepoint just for migrate jobs to another place ? In addition to the savepoint performance problem in the background description, a very important reason is that the migration requirement may come from the failure of the original cluster. In this case, there is no opportunity to trigger savepoint.

Proposal

job's checkpoint directory (user-defined-checkpoint-dir/<jobId>) contains all their state files (self-contained)
As far as I know, in the current status, only the subsequent checkpoints of the jobs restored from the retained checkpoint violate this constraint. One possible solution is to re-upload all shared files at the first incremental checkpoint after the job started, but we need to discuss how to distinguish between a new job instance and a restart.

use relative file path in checkpoint metadata (relocatable)
Change all file references in checkpoint metadata to the relative path relative to the _metadata file, so we can copy user-defined-checkpoint-dir/<jobId> to any other place.

BTW, this issue is so similar to ~~FLINK-5763~~ , we can read it as a background supplement.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2021-09-08-17-06-31-560.png
08/Sep/21 09:06
309 kB
Feifan Wang
image-2021-09-08-17-10-28-240.png
08/Sep/21 09:10
209 kB
Feifan Wang
image-2021-09-08-17-55-46-898.png
08/Sep/21 09:55
221 kB
Feifan Wang
image-2021-09-08-18-01-03-176.png
08/Sep/21 10:01
469 kB
Feifan Wang
image-2021-09-14-14-22-31-537.png
14/Sep/21 06:22
97 kB
Feifan Wang

Issue Links

is related to

FLINK-25276 FLIP-203: Support native and incremental savepoints

Closed

FLINK-25154 FLIP-193: Snapshots ownership

Resolved

FLINK-5763 Make savepoints self-contained and relocatable

Closed

links to

GitHub Pull Request #17136

Activity

People

Assignee:: Unassigned

Reporter:: Feifan Wang

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 03/Sep/21 10:52

Updated:: 07/Apr/22 11:44

Resolved:: 07/Apr/22 10:18