Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-15031

Automatically calculate required network memory for fine-grained jobs

    XMLWordPrintableJSON

Details

    Description

      In cases where resources are specified, we expect each operator to declare required resources before using them. In this way, no resource related error should happen if resources are not used beyond what was declared. This ensures a deployed task would not fail due to insufficient resources in TM, which may result in unnecessary failures and may even cause a job hanging forever, failing repeatedly on deploying tasks to a TM with insufficient resources.

      Shuffle memory is the last missing piece for this goal at the moment. Minimum network buffers are required by tasks to work. Currently a task is possible to be deployed to a TM with insufficient network buffers, and fails on launching.

      To avoid that, we should calculate required network memory for a task/SlotSharingGroup before allocating a slot for it.

      The required shuffle memory can be derived from the number of required network buffers. The number of buffers required by a task (ExecutionVertex) is

      exclusive buffers for input channels(i.e. numInputChannel * buffersPerChannel) + required buffers for result partition buffer pool(currently is numberOfSubpartitions + 1)
      

      Note that this is for the NettyShuffleService case. For custom shuffle services, currently there is no way to get the required shuffle memory of a task.

      To make it simple under dynamic slot sharing, the required shuffle memory for a task should be the max required shuffle memory of all ExecutionVertex of the same ExecutionJobVertex. And the required shuffle memory for a slot sharing group should be the sum of shuffle memory for each ExecutionJobVertex instance within.

      Attachments

        Issue Links

          Activity

            People

              jinxing6042@126.com Jin Xing
              zhuzh Zhu Zhu
              Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 10m
                  10m