[FLINK-20912] Increase Log and Metric: Time consumed by Checkpoint Restore - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Abandoned
Affects Version/s: 1.12.1, 1.13.0
Fix Version/s: None
Component/s: Runtime / Checkpointing, Runtime / State Backends
Labels:
- auto-deprioritized-major
- auto-deprioritized-minor

Description

In a production environment, some jobs with higher SLAs need to be restarted quickly if failover occurs. Checkpoint restore is an important part of task start. When the Flink task starts slowly, the related Log and Metric should be added to facilitate troubleshooting.

For example: ByteDance shared in FFA 2020: They made OperatorState parallelized restore. Without these metrics, there will be two problems:
1. It is not easy to find the problem. If the task starts slowly, it is not known whether the root cause is the slow Checkpoint restore.
2. If optimized, how much speed has been improved for restore? Need to be quantified.

I believe that many companies have made relevant metrics in their internal Flink versions.

Attachments

Issue Links

duplicates

FLINK-17012 Expose stage of task initialization

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Rui Fan

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 10/Jan/21 09:05

Updated:: 07/Mar/22 10:56

Resolved:: 07/Mar/22 10:56