Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.6.2, 1.7.0
Description
Currently, YarnResourceManager does use yarn.maximum-failed-containers as limit of resource acquirement. In worse case, when new start containers consistently fail, YarnResourceManager will goes into an infinite resource acquirement process without failing the job. Together with the https://issues.apache.org/jira/browse/FLINK-10848, It will quick occupy all resources of yarn queue.
Attachments
Issue Links
- Blocked
-
FLINK-30095 Flink's JobCluster ResourceManager should throw an exception when the failure number of starting worker reaches the maximum failure rate
- Open
- causes
-
FLINK-21139 ThresholdMeterTest.testMarkMultipleEvents unstable
- Closed
- fixes
-
FLINK-17127 Make pod creating retry interval configurable
- Closed
- links to