[FLINK-33695] FLIP-384: Introduce TraceReporter and use it to create checkpointing and recovery traces - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.19.0
Component/s: Runtime / Checkpointing, Runtime / Metrics
Labels:
None

Description

https://cwiki.apache.org/confluence/x/TguZE

Motivation
Currently Flink has a limited observability of checkpoint and recovery processes.

For checkpointing Flink has a very detailed overview in the Flink WebUI, which works great in many use cases, however it’s problematic if one is operating multiple Flink clusters, or if cluster/JM dies. Additionally there are a couple of metrics (like lastCheckpointDuration or lastCheckpointSize), however those metrics have a couple of issues:

They are reported and refreshed periodically, depending on the MetricReporter settings, which doesn’t take into account checkpointing frequency.
- If checkpointing interval > metric reporting interval, we would be reporting the same values multiple times.
- If checkpointing interval < metric reporting interval, we would be randomly dropping metrics for some of the checkpoints.

For recovery we are missing even the most basic of the metrics and Flink WebUI support. Also given the fact that recovery is even less frequent compared to checkpoints, adding recovery metrics would have even bigger problems with unnecessary reporting the same values.

In this FLIP I’m proposing to add support for reporting traces/spans (example: Traces) and use this mechanism to report checkpointing and recovery traces. I hope in the future traces will also prove useful in other areas of Flink like job submission, job state changes, ... . Moreover as the API to report traces will be added to the MetricGroup , users will be also able to access this API.

Attachments

Issue Links

is related to

FLINK-33696 FLIP-385: Add OpenTelemetryTraceReporter and OpenTelemetryMetricReporter

Closed

relates to

FLINK-33697 FLIP-386: Support adding custom metrics in Recovery Spans

Closed

supercedes

FLINK-23411 Expose Flink checkpoint details metrics

Closed

FLINK-7894 Improve metrics around fine-grained recovery and associated checkpointing behaviors

Closed

Sub-Tasks

There are no Sub-Tasks for this issue.

Activity

People

Assignee:: Piotr Nowojski

Reporter:: Piotr Nowojski

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 29/Nov/23 16:54

Updated:: 05/Jan/24 16:25

Resolved:: 05/Jan/24 16:25