Details
-
New Feature
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
Description
https://cwiki.apache.org/confluence/x/TguZE
Motivation
Currently Flink has a limited observability of checkpoint and recovery processes.
For checkpointing Flink has a very detailed overview in the Flink WebUI, which works great in many use cases, however it’s problematic if one is operating multiple Flink clusters, or if cluster/JM dies. Additionally there are a couple of metrics (like lastCheckpointDuration or lastCheckpointSize), however those metrics have a couple of issues:
- They are reported and refreshed periodically, depending on the MetricReporter settings, which doesn’t take into account checkpointing frequency.
- If checkpointing interval > metric reporting interval, we would be reporting the same values multiple times.
- If checkpointing interval < metric reporting interval, we would be randomly dropping metrics for some of the checkpoints.
For recovery we are missing even the most basic of the metrics and Flink WebUI support. Also given the fact that recovery is even less frequent compared to checkpoints, adding recovery metrics would have even bigger problems with unnecessary reporting the same values.
In this FLIP I’m proposing to add support for reporting traces/spans (example: Traces) and use this mechanism to report checkpointing and recovery traces. I hope in the future traces will also prove useful in other areas of Flink like job submission, job state changes, ... . Moreover as the API to report traces will be added to the MetricGroup , users will be also able to access this API.
Attachments
Issue Links
- is related to
-
FLINK-33696 FLIP-385: Add OpenTelemetryTraceReporter and OpenTelemetryMetricReporter
- Closed
- relates to
-
FLINK-33697 FLIP-386: Support adding custom metrics in Recovery Spans
- Closed
- supercedes
-
FLINK-23411 Expose Flink checkpoint details metrics
- Closed
-
FLINK-7894 Improve metrics around fine-grained recovery and associated checkpointing behaviors
- Closed