Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-12342

Yarn Resource Manager Acquires Too Many Containers

    XMLWordPrintableJSON

    Details

    • Release Note:
      With Flink 1.9.0 the Yarn heartbeat configuration parameter has been renamed from `yarn.heartbeat-delay` to `yarn.heartbeat.interval`.

      Description

      In currently implementation of YarnFlinkResourceManager, it starts to acquire new container one by one when get request from SlotManager. The mechanism works when job is still, say less than 32 containers. If the job has 256 container, containers can't be immediately allocated and appending requests in AMRMClient will be not removed accordingly. We observe the situation that AMRMClient ask for current pending request + 1 (the new request from slot manager) containers. In this way, during the start time of such job, it asked for 4000+ containers. If there is an external dependency issue happens, for example hdfs access is slow. Then, the whole job will be blocked without getting enough resource and finally killed with SlotManager request timeout.

      Thus, we should use the total number of container asked rather than pending request in AMRMClient as threshold to make decision whether we need to add one more resource request.

        Attachments

        1. container.log
          1.05 MB
          Zhenqiu Huang
        2. flink-1.4.png
          211 kB
          Zhenqiu Huang
        3. flink-1.6.png
          232 kB
          Zhenqiu Huang
        4. Screen Shot 2019-04-29 at 12.06.23 AM.png
          211 kB
          Zhenqiu Huang

          Issue Links

            Activity

              People

              • Assignee:
                hpeter Zhenqiu Huang
                Reporter:
                ZhenqiuHuang Zhenqiu Huang
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m