Details
-
Improvement
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
None
Description
One of the most important metrics is missing in the checkpoint stats: "start delay" (aka "barrier lag"), meaning the time it between when the checkpoint was triggered and when the barriers arrive at a task.
That time is critical to identify if a checkpoint takes too long because of backpressure or other contention.
You can implicitly calculate this by "end_to_end_time - sync_time - async_time", but it is much more obvious for users that something is up when this number is explicitly shown.
Attachments
Attachments
Issue Links
- relates to
-
FLINK-18656 Start Delay metric is always zero for unaligned checkpoints and source tasks
- Closed
- links to