[FLINK-12342] Yarn Resource Manager Acquires Too Many Containers - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.6.4, 1.7.2, 1.8.0
Fix Version/s: 1.8.3, 1.9.2, 1.10.0
Component/s: Deployment / YARN
Labels:
- pull-request-available
Environment:

We runs job in Flink release 1.6.3.

Release Note:
With Flink 1.9.0 the Yarn heartbeat configuration parameter has been renamed from `yarn.heartbeat-delay` to `yarn.heartbeat.interval`.

Description

In currently implementation of YarnFlinkResourceManager, it starts to acquire new container one by one when get request from SlotManager. The mechanism works when job is still, say less than 32 containers. If the job has 256 container, containers can't be immediately allocated and appending requests in AMRMClient will be not removed accordingly. We observe the situation that AMRMClient ask for current pending request + 1 (the new request from slot manager) containers. In this way, during the start time of such job, it asked for 4000+ containers. If there is an external dependency issue happens, for example hdfs access is slow. Then, the whole job will be blocked without getting enough resource and finally killed with SlotManager request timeout.

Thus, we should use the total number of container asked rather than pending request in AMRMClient as threshold to make decision whether we need to add one more resource request.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

container.log
29/Apr/19 08:09
1.05 MB
Zhenqiu Huang
flink-1.4.png
29/Apr/19 18:42
211 kB
Zhenqiu Huang
flink-1.6.png
29/Apr/19 18:42
232 kB
Zhenqiu Huang
Screen Shot 2019-04-29 at 12.06.23 AM.png
29/Apr/19 18:41
211 kB
Zhenqiu Huang

Issue Links

is caused by

FLINK-13184 Starting a TaskExecutor blocks the YarnResourceManager's main thread

Closed

relates to

FLINK-14582 Do not upload {uuid}-taskmanager-conf.yaml for each task manager container

Closed

links to

GitHub Pull Request #8306

GitHub Pull Request #10089

Activity

People

Assignee:: Till Rohrmann

Reporter:: Zhenqiu Huang

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 26/Apr/19 17:39

Updated:: 07/Nov/19 10:31

Resolved:: 06/Nov/19 22:53

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

40m