Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-12373

Improve checkpointing metrics

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Reopened
    • Not a Priority
    • Resolution: Unresolved
    • None
    • None
    • None

    Description

      The checkpoint metrics encapsulated in the CheckpointMetrics class currently exposes 4 core metrics for each operator: bytesBuffered, alignment time, sync duration and async duration

      I think it would be a great improvement to break up the tracking of the sync duration into the different components as it contains information that is critical to improve the SLA of large jobs.

      I suggest we break up the sync duration into 4 subcomponents:

       1. prepareSnapshotPreBarrier
       2. Snapshot timers
       3. Snapshot operator states
       4. Sync keyed state checkpoint

      Maybe the operator state part could be further broken up into keyed/non-keyed part, i dont know.

      I think knowing these metrics is crucial for users to minimise the latency caused by checkpointing.

      Whether we want to show all this info on the web ui is another discussion

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            gyfora Gyula Fora
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: