[FLINK-10868] Flink's JobCluster ResourceManager doesn't use maximum-failed-containers as limit of resource acquirement - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.6.2, 1.7.0
Fix Version/s: 1.13.0
Component/s: Deployment / Mesos, Deployment / YARN
Labels:
- pull-request-available

Description

Currently, YarnResourceManager does use yarn.maximum-failed-containers as limit of resource acquirement. In worse case, when new start containers consistently fail, YarnResourceManager will goes into an infinite resource acquirement process without failing the job. Together with the https://issues.apache.org/jira/browse/FLINK-10848, It will quick occupy all resources of yarn queue.

Attachments

Issue Links

Blocked

FLINK-30095 Flink's JobCluster ResourceManager should throw an exception when the failure number of starting worker reaches the maximum failure rate

Open

causes

FLINK-21139 ThresholdMeterTest.testMarkMultipleEvents unstable

Closed

fixes

FLINK-17127 Make pod creating retry interval configurable

Closed

links to

GitHub Pull Request #8952

Activity

People

Assignee:: Zhenqiu Huang

Reporter:: Zhenqiu Huang

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 13/Nov/18 17:15

Updated:: 19/Nov/22 09:48

Resolved:: 13/Jan/21 01:09

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

0.5h