[YARN-4698] Negative value in RM UI counters due to double container release - ASF JIRA

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Duplicate
Affects Version/s: 2.5.1
Fix Version/s: None
Component/s: fairscheduler, resourcemanager
Labels:
None

Description

We noticed that on our cluster there are negative values in RM UI counters:

Containers Running: -19
Memory Used: -38GB
Vcores Used: -19

After we checked RM logs, we found, that the following events had happened:

Assigned container: 67019 times
Released container: 67019 times
Invalid container released: 19 times

Some log records related can be found within "Example.log-cut" attachment.

After some investigation we made a conclusion that there is some kind of race condition for container that was scheduled for killing, but was completed successfully before kill.
Also, there is a patch that possibly mitigates effects of the issue, but doesn't solve original problem (see mitigating2.5.1diff).
Unfortunately, the cluster and all other logs are lost, because the report was made about a year ago, but wasn't submitted properly. Also, we don't know if the issue exist in other versions.