Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-18662

Provide more detailed metrics why unaligned checkpoint is taking long time

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Hide
      This is modifing the {{checkpointAlignmentTime}} metric for the at-least-once checkopinting and unaligned exactly-once checkpointing. Previously it was
      always 0, now it's defined as the duration between processing first and the last checkpoint barrier.
      Show
      This is modifing the {{checkpointAlignmentTime}} metric for the at-least-once checkopinting and unaligned exactly-once checkpointing. Previously it was always 0, now it's defined as the duration between processing first and the last checkpoint barrier.

    Description

      With unaligned checkpoint there can happen situation as in the attached screenshot.

      Task reports long end to end checkpoint time (~2h50min), ~0s sync time, ~2h50min async time, ~0s start delay. It means that task received first checkpoint barrier from one of the channels very quickly (~0s), sync part was quick, but we do not know why async part was taking so long. It could be because of three things:

      1. long operator state IO writes
      2. long spilling of in-flight data
      3. long time to receive the final checkpoint barrier from the last lagging channel

      First and second are probably indistinguishable and the difference between them doesn't matter much for analyzing. However the last one is quite different. It might be independent of the IO, and we are missing this information.

      Maybe we could report it as "alignment duration" and while we are at it, we could also report amount of spilled in-flight data for unaligned checkpoints as "alignment buffered"?

      Ideally we should report it as new metrics, but that leaves a question how to display it in the UI, with limited space available. Maybe it could be reported as:

      Alignment Buffered Alignment Duration
      0 B (632 MB) 0ms (2h 49m 32s)

      Where the values in the parenthesis would come from unaligned checkpoints.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            pnowojski Piotr Nowojski
            pnowojski Piotr Nowojski
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment