Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-4410

Report more information about operator checkpoints

    Details

      Description

      Checkpoint statistics contain the duration of a checkpoint, measured as from the CheckpointCoordinator's start to the point when the acknowledge message came.

      We should additionally expose

      • duration of the synchronous part of a checkpoint
      • duration of the asynchronous part of a checkpoint
      • number of bytes buffered during the stream alignment phase
      • duration of the stream alignment phase

      Note: In the case of using at-least once semantics, the latter two will always be zero.

        Issue Links

          Activity

          Hide
          ivan.mushketyk Ivan Mushketyk added a comment -

          Hi Ufuk Celebi.

          Just to make sure that I understand correctly what do you mean by synchronous and asynchronous parts. Do I understand correctly that they are:

          • synchronous - time span between checkpoint is initiated and the moment when TriggerCheckpoint messages are sent
          • asynchronous - time between all TriggerCheckpoint messages are sent and all replies are received
          Show
          ivan.mushketyk Ivan Mushketyk added a comment - Hi Ufuk Celebi . Just to make sure that I understand correctly what do you mean by synchronous and asynchronous parts. Do I understand correctly that they are: synchronous - time span between checkpoint is initiated and the moment when TriggerCheckpoint messages are sent asynchronous - time between all TriggerCheckpoint messages are sent and all replies are received
          Hide
          aljoscha Aljoscha Krettek added a comment -

          Hi,
          there's actually three different durations that could be reported:

          • time from the checkpoint coordinator initiating a checkpoint to an operator acknowledging that checkpoint
          • time that an operator spends in the synchronous part of the checkpoint
          • time that an operator spends in the asynchronous part of the checkpoint

          About synchronous/asynchronous. For this you can look at StreamTask.performCheckpoint(). At the end of the method a Thread is started that does the asynchronous work of the checkpoint and the method returns. Thus, time until then would be the synchronous part and the time spend in that thread would be the asynchronous part.

          Show
          aljoscha Aljoscha Krettek added a comment - Hi, there's actually three different durations that could be reported: time from the checkpoint coordinator initiating a checkpoint to an operator acknowledging that checkpoint time that an operator spends in the synchronous part of the checkpoint time that an operator spends in the asynchronous part of the checkpoint About synchronous/asynchronous. For this you can look at StreamTask.performCheckpoint() . At the end of the method a Thread is started that does the asynchronous work of the checkpoint and the method returns. Thus, time until then would be the synchronous part and the time spend in that thread would be the asynchronous part.
          Hide
          StephanEwen Stephan Ewen added a comment -

          I will take up this issue - have a pretty good plan how to do this and do some overdue cleanup in the process.

          Show
          StephanEwen Stephan Ewen added a comment - I will take up this issue - have a pretty good plan how to do this and do some overdue cleanup in the process.
          Hide
          StephanEwen Stephan Ewen added a comment -

          Code that gathers additional metrics is integrated. What is missing is the code that collects the information on the side of the CheckpointCoordinator and visualizes it in the web frontend.

          Show
          StephanEwen Stephan Ewen added a comment - Code that gathers additional metrics is integrated. What is missing is the code that collects the information on the side of the CheckpointCoordinator and visualizes it in the web frontend.
          Hide
          rmetzger Robert Metzger added a comment -

          I'll add the CheckpointMetrics to the coordinator and the web frontend.

          Show
          rmetzger Robert Metzger added a comment - I'll add the CheckpointMetrics to the coordinator and the web frontend.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user uce opened a pull request:

          https://github.com/apache/flink/pull/3042

          FLINK-4410 Expose more fine grained checkpoint statistics

          This PR exposes more fine grained checkpoint statistics. The previous version of the tracking code had a couple of short comings:

          • Only completed checkpoints were tracked in the history. You did not see in progress or failed checkpoints.
          • Only the latest completed checkpoint had more fine grained stats per operator and sub tasks. This meant that a possibly interesting checkpoint statistics could be live updated as you was looking at it.
          • Many newly tracked statistics like checkpoint duration at the operator or alignment duration were not exposed.

          This PR addresses these issues. For the extended tracking of the life cycle I decided to add tracking callbacks of all relevant entities like `PendingCheckpointStats`, `CompletedCheckpointStats`, `SubtaskStateStats`, `TaskStateStats`, etc. The life cycle of these objects follows that of their corresponding entities.

          Furtheremore, this add new REST API handlers that work with the new tracker and also new layout for displaying them.

          Some screenshots:

          *Clicking on the Checkpoints Tab*: Sub tabs for overview, history, summary stats, and the config.

          ![00-start](https://cloud.githubusercontent.com/assets/1756620/21461971/3fdfb9be-c957-11e6-9f61-62610aa95da4.png)

          *Clicking on the History Tab*: Lists recent checkpoints, including in progress ones.

          ![01-history](https://cloud.githubusercontent.com/assets/1756620/21461994/657fd0a0-c957-11e6-8d08-0f084e018aca.png)

          *Clicking on details for a checkpoint*:

          ![02-details](https://cloud.githubusercontent.com/assets/1756620/21462027/ce4577a2-c957-11e6-9851-9d225c3762f4.png)

          *After triggering a savepoint*:

          ![03-savepoint](https://cloud.githubusercontent.com/assets/1756620/21462031/d6857318-c957-11e6-810b-e6d639b5caaf.png)

          *Details for the triggered savepoint*:

          ![04-savepoint_details](https://cloud.githubusercontent.com/assets/1756620/21462038/e80c1916-c957-11e6-984c-2447ec877c2d.png)

          *Failed checkpoint while cancelling job*:

          ![05-failed_checkpoint](https://cloud.githubusercontent.com/assets/1756620/21462049/f9ac90f6-c957-11e6-8e0d-48dba2581378.png)

          ![06-failed_checkpoint_details](https://cloud.githubusercontent.com/assets/1756620/21462052/fdd2e068-c957-11e6-9cb6-e4ece5c5dd36.png)

          ![07-failed_checkpoint_overview](https://cloud.githubusercontent.com/assets/1756620/21462062/05fd444a-c958-11e6-8fc5-580f4e9e4e18.png)

          *Clicking on the config tab*:

          ![09-config](https://cloud.githubusercontent.com/assets/1756620/21462067/0d3f6210-c958-11e6-9e1a-0767a8f557a5.png)

          *After restoring from the savepoint*:

          ![08-restore_from_savepoint](https://cloud.githubusercontent.com/assets/1756620/21462071/1559a97e-c958-11e6-8ce5-b4287408d918.png)

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/uce/flink 4410-checkpoint_stats

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/3042.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #3042


          commit 700ec439ed0e9fb00c52e6e373a5bcccfecce963
          Author: Ufuk Celebi <uce@apache.org>
          Date: 2016-12-23T19:31:29Z

          FLINK-4410 [runtime, runtime-web] Remove old checkpoint stats tracker code

          commit c3f50c956f281a316a17b390851443c5be3adb6c
          Author: Ufuk Celebi <uce@apache.org>
          Date: 2016-12-23T19:37:08Z

          FLINK-4410 [runtime] Rework checkpoint stats tracking

          commit 1db53a69829be8472fb74b6b83f0d3638121762f
          Author: Ufuk Celebi <uce@apache.org>
          Date: 2016-12-23T19:44:12Z

          FLINK-4410 [runtime-web] Add detailed checkpoint stats handlers

          commit d6f6e7d48e05da47e02e8710fca699104bcc5988
          Author: Ufuk Celebi <uce@apache.org>
          Date: 2016-12-23T19:44:59Z

          FLINK-4410 [runtime-web] Add new layout for checkpoint stats

          commit ab6c597f51c4aeea81dde0f82a3e1e7e72571ad9
          Author: Ufuk Celebi <uce@apache.org>
          Date: 2016-12-23T19:47:02Z

          FLINK-4410 [runtime-web] Rebuild JS/HTML files


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user uce opened a pull request: https://github.com/apache/flink/pull/3042 FLINK-4410 Expose more fine grained checkpoint statistics This PR exposes more fine grained checkpoint statistics. The previous version of the tracking code had a couple of short comings: Only completed checkpoints were tracked in the history. You did not see in progress or failed checkpoints. Only the latest completed checkpoint had more fine grained stats per operator and sub tasks. This meant that a possibly interesting checkpoint statistics could be live updated as you was looking at it. Many newly tracked statistics like checkpoint duration at the operator or alignment duration were not exposed. This PR addresses these issues. For the extended tracking of the life cycle I decided to add tracking callbacks of all relevant entities like `PendingCheckpointStats`, `CompletedCheckpointStats`, `SubtaskStateStats`, `TaskStateStats`, etc. The life cycle of these objects follows that of their corresponding entities. Furtheremore, this add new REST API handlers that work with the new tracker and also new layout for displaying them. — Some screenshots: * Clicking on the Checkpoints Tab *: Sub tabs for overview, history, summary stats, and the config. ! [00-start] ( https://cloud.githubusercontent.com/assets/1756620/21461971/3fdfb9be-c957-11e6-9f61-62610aa95da4.png ) * Clicking on the History Tab *: Lists recent checkpoints, including in progress ones. ! [01-history] ( https://cloud.githubusercontent.com/assets/1756620/21461994/657fd0a0-c957-11e6-8d08-0f084e018aca.png ) * Clicking on details for a checkpoint *: ! [02-details] ( https://cloud.githubusercontent.com/assets/1756620/21462027/ce4577a2-c957-11e6-9851-9d225c3762f4.png ) * After triggering a savepoint *: ! [03-savepoint] ( https://cloud.githubusercontent.com/assets/1756620/21462031/d6857318-c957-11e6-810b-e6d639b5caaf.png ) * Details for the triggered savepoint *: ! [04-savepoint_details] ( https://cloud.githubusercontent.com/assets/1756620/21462038/e80c1916-c957-11e6-984c-2447ec877c2d.png ) * Failed checkpoint while cancelling job *: ! [05-failed_checkpoint] ( https://cloud.githubusercontent.com/assets/1756620/21462049/f9ac90f6-c957-11e6-8e0d-48dba2581378.png ) ! [06-failed_checkpoint_details] ( https://cloud.githubusercontent.com/assets/1756620/21462052/fdd2e068-c957-11e6-9cb6-e4ece5c5dd36.png ) ! [07-failed_checkpoint_overview] ( https://cloud.githubusercontent.com/assets/1756620/21462062/05fd444a-c958-11e6-8fc5-580f4e9e4e18.png ) * Clicking on the config tab *: ! [09-config] ( https://cloud.githubusercontent.com/assets/1756620/21462067/0d3f6210-c958-11e6-9e1a-0767a8f557a5.png ) * After restoring from the savepoint *: ! [08-restore_from_savepoint] ( https://cloud.githubusercontent.com/assets/1756620/21462071/1559a97e-c958-11e6-8ce5-b4287408d918.png ) You can merge this pull request into a Git repository by running: $ git pull https://github.com/uce/flink 4410-checkpoint_stats Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3042.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3042 commit 700ec439ed0e9fb00c52e6e373a5bcccfecce963 Author: Ufuk Celebi <uce@apache.org> Date: 2016-12-23T19:31:29Z FLINK-4410 [runtime, runtime-web] Remove old checkpoint stats tracker code commit c3f50c956f281a316a17b390851443c5be3adb6c Author: Ufuk Celebi <uce@apache.org> Date: 2016-12-23T19:37:08Z FLINK-4410 [runtime] Rework checkpoint stats tracking commit 1db53a69829be8472fb74b6b83f0d3638121762f Author: Ufuk Celebi <uce@apache.org> Date: 2016-12-23T19:44:12Z FLINK-4410 [runtime-web] Add detailed checkpoint stats handlers commit d6f6e7d48e05da47e02e8710fca699104bcc5988 Author: Ufuk Celebi <uce@apache.org> Date: 2016-12-23T19:44:59Z FLINK-4410 [runtime-web] Add new layout for checkpoint stats commit ab6c597f51c4aeea81dde0f82a3e1e7e72571ad9 Author: Ufuk Celebi <uce@apache.org> Date: 2016-12-23T19:47:02Z FLINK-4410 [runtime-web] Rebuild JS/HTML files
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/3042

          Wow, this looks awesome, great work.

          Small note: In screenshot 2, it says "End to Duration", which should probably be "End to End Duration".

          Is there also a column that shows the synchronous and asynchronous parts of the checkpointing time?

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3042 Wow, this looks awesome, great work. Small note: In screenshot 2, it says "End to Duration", which should probably be "End to End Duration". Is there also a column that shows the synchronous and asynchronous parts of the checkpointing time?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/3042

          I think what would put a cherry on top is if we can break the "End To End Duration" down into

          • Delay till triggering (how long until all barriers were there)
          • Synchronous checkpoint time
          • Asynchronous checkpoint time

          That would help big time, as many users currently get confused when checkpoints have long async times, assuming that the computation halts for that time.

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3042 I think what would put a cherry on top is if we can break the "End To End Duration" down into Delay till triggering (how long until all barriers were there) Synchronous checkpoint time Asynchronous checkpoint time That would help big time, as many users currently get confused when checkpoints have long async times, assuming that the computation halts for that time.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user uce commented on the issue:

          https://github.com/apache/flink/pull/3042

          Hey Stephan! Thanks for spotting the typo. The numbers are already reported but I forgot to attach the screenshot. They are only reported for each operator/task though, because I'm not sure what would be the best way to summarize those numbers as simply displaying the sum or maximum as the checkpoint alignment duration, sync/async checkpoint duration does not work well imo.

          ![subtask-details](https://cloud.githubusercontent.com/assets/1756620/21466781/c7af182a-c9d5-11e6-8022-9870b9c3e74a.png)

          These numbers you get for all tracked checkpoints.

          Show
          githubbot ASF GitHub Bot added a comment - Github user uce commented on the issue: https://github.com/apache/flink/pull/3042 Hey Stephan! Thanks for spotting the typo. The numbers are already reported but I forgot to attach the screenshot. They are only reported for each operator/task though, because I'm not sure what would be the best way to summarize those numbers as simply displaying the sum or maximum as the checkpoint alignment duration, sync/async checkpoint duration does not work well imo. ! [subtask-details] ( https://cloud.githubusercontent.com/assets/1756620/21466781/c7af182a-c9d5-11e6-8022-9870b9c3e74a.png ) These numbers you get for all tracked checkpoints.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/3042

          Okay, super. The 'time to triggering' is then `end_to_end_time - sync_time - async_time`?

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3042 Okay, super. The 'time to triggering' is then `end_to_end_time - sync_time - async_time`?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user uce commented on the issue:

          https://github.com/apache/flink/pull/3042

          Sorry for the long delay here. We can add that number as a separate column Triggering Delay. What do you think? As a follow up I would also like to add documentation about what all the numbers mean.

          Show
          githubbot ASF GitHub Bot added a comment - Github user uce commented on the issue: https://github.com/apache/flink/pull/3042 Sorry for the long delay here. We can add that number as a separate column Triggering Delay . What do you think? As a follow up I would also like to add documentation about what all the numbers mean.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/3042

          I think a triggering delay would be very nice and helpful.
          We can do this as a next step, separate from this pull request.

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3042 I think a triggering delay would be very nice and helpful. We can do this as a next step, separate from this pull request.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user rmetzger commented on the issue:

          https://github.com/apache/flink/pull/3042

          Very nice change!

          I would love to merge it to 1.2 as well, its so helpful!
          One very minor thing: I would suggest to round the percentages shown for the completion. I had an instance where it was showing a progress of 51.000000001%.

          Here's a screenshot from my testing job:
          ![image](https://cloud.githubusercontent.com/assets/89049/21768632/e7800fcc-d67a-11e6-8961-6063f4faa138.png)

          Show
          githubbot ASF GitHub Bot added a comment - Github user rmetzger commented on the issue: https://github.com/apache/flink/pull/3042 Very nice change! I would love to merge it to 1.2 as well, its so helpful! One very minor thing: I would suggest to round the percentages shown for the completion. I had an instance where it was showing a progress of 51.000000001%. Here's a screenshot from my testing job: ! [image] ( https://cloud.githubusercontent.com/assets/89049/21768632/e7800fcc-d67a-11e6-8961-6063f4faa138.png )
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user uce commented on the issue:

          https://github.com/apache/flink/pull/3042

          Thanks for checking it out Robert. Would love to merge it for 1.2 as well. I fixed the rounding issue.

          Show
          githubbot ASF GitHub Bot added a comment - Github user uce commented on the issue: https://github.com/apache/flink/pull/3042 Thanks for checking it out Robert. Would love to merge it for 1.2 as well. I fixed the rounding issue.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user rmetzger commented on the issue:

          https://github.com/apache/flink/pull/3042

          +1 to merge

          Show
          githubbot ASF GitHub Bot added a comment - Github user rmetzger commented on the issue: https://github.com/apache/flink/pull/3042 +1 to merge
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flink/pull/3042

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/3042
          Hide
          uce Ufuk Celebi added a comment -

          All sub tasks have been implemented.

          Show
          uce Ufuk Celebi added a comment - All sub tasks have been implemented.

            People

            • Assignee:
              uce Ufuk Celebi
              Reporter:
              uce Ufuk Celebi
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development