We noticed that on our cluster there are negative values in RM UI counters:
- Containers Running: -19
- Memory Used: -38GB
- Vcores Used: -19
After we checked RM logs, we found, that the following events had happened:
- Assigned container: 67019 times
- Released container: 67019 times
- Invalid container released: 19 times
Some log records related can be found within "Example.log-cut" attachment.
After some investigation we made a conclusion that there is some kind of race condition for container that was scheduled for killing, but was completed successfully before kill.
Also, there is a patch that possibly mitigates effects of the issue, but doesn't solve original problem (see mitigating2.5.1diff).
Unfortunately, the cluster and all other logs are lost, because the report was made about a year ago, but wasn't submitted properly. Also, we don't know if the issue exist in other versions.