Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-14712

Improve back-pressure reporting mechanism

    XMLWordPrintableJSON

    Details

      Description

      (1) The current monitor is heavy-weight. 

      •   Backpressure monitoring works by repeatedly taking stack trace samples of your running tasks.

      (2) It is difficult to find out which vertex is the source  of  backpressure.

      • User need to know current and upstream's network metric to judge current whether is the source of backpressure. Now user has to record relevant information.

      Proposed Changes

      1. expose the new mechanism implemented in FLINK-14472 as a "is back-pressured" metric.

      2. show the vertex that produces the backpressure source for the job.

      3. expose network metric in IOMetricsInfo:

      • SubTask
        •  pool usage: outPoolUsage, inputExclusiveBuffersUsage, inputFloatingBuffersUsage.
          • If the subtask is not back pressured, but it is causing backpressure (full input, empty output)
          • By comparing exclusive/floating buffers usage, whether all channels are back-pressure or only some of them
        • back-pressured for show whether it is back pressured.
      • Vertex
        • pool usage: outPoolUsageAvg, inputExclusiveBuffersUsageAvg, inputFloatingBuffersUsageAvg
        • back-pressured for show whether it is back pressured(merge all iths subtasks)

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                lining lining
                Reporter:
                lining lining
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 50m
                  50m