Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-33695

FLIP-384: Introduce TraceReporter and use it to create checkpointing and recovery traces

    XMLWordPrintableJSON

Details

    Description

      https://cwiki.apache.org/confluence/x/TguZE

      Motivation
      Currently Flink has a limited observability of checkpoint and recovery processes.

      For checkpointing Flink has a very detailed overview in the Flink WebUI, which works great in many use cases, however it’s problematic if one is operating multiple Flink clusters, or if cluster/JM dies. Additionally there are a couple of metrics (like lastCheckpointDuration or lastCheckpointSize), however those metrics have a couple of issues:

      • They are reported and refreshed periodically, depending on the MetricReporter settings, which doesn’t take into account checkpointing frequency.
        • If checkpointing interval > metric reporting interval, we would be reporting the same values multiple times.
        • If checkpointing interval < metric reporting interval, we would be randomly dropping metrics for some of the checkpoints.

      For recovery we are missing even the most basic of the metrics and Flink WebUI support. Also given the fact that recovery is even less frequent compared to checkpoints, adding recovery metrics would have even bigger problems with unnecessary reporting the same values.

      In this FLIP I’m proposing to add support for reporting traces/spans (example: Traces) and use this mechanism to report checkpointing and recovery traces. I hope in the future traces will also prove useful in other areas of Flink like job submission, job state changes, ... . Moreover as the API to report traces will be added to the MetricGroup , users will be also able to access this API.

      Attachments

        Issue Links

          Activity

            People

              pnowojski Piotr Nowojski
              pnowojski Piotr Nowojski
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: