[FLINK-9190] YarnResourceManager sometimes does not request new Containers - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.5.0
Fix Version/s: 1.5.0
Component/s: Deployment / YARN, Runtime / Coordination
Labels:
- flip-6
- pull-request-available
Environment:

Hadoop 2.8.3
ZooKeeper 3.4.5
Flink 71c3cd2781d36e0a03d022a38cc4503d343f7ff8

Description

Description
The YarnResourceManager does not request new containers if TaskManagers are killed rapidly in succession. After 5 minutes the job is restarted due to NoResourceAvailableException, and the job runs normally afterwards. I suspect that TaskManager failures are not registered if the failure occurs before the TaskManager registers with the master. Logs are attached; I added additional log statements to YarnResourceManager.onContainersCompleted and YarnResourceManager.onContainersAllocated.

Expected Behavior
The YarnResourceManager should recognize that the container is completed and keep requesting new containers. The job should run as soon as resources are available.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

yarn-logs
17/Apr/18 12:53
3.04 MB
Gary Yao

Issue Links

links to

GitHub Pull Request #5881

GitHub Pull Request #5931

Activity

People

Assignee:: Gary Yao

Reporter:: Gary Yao

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 17/Apr/18 12:59

Updated:: 07/Sep/18 17:16

Resolved:: 10/May/18 14:24