YARN-6212. However, I think it is not a duplicate of YARN-3933.
The primary cause of negative values is that metrics do not recover properly when NM restart.
AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores in metrics also need to recover when NM restart.
This should be done in ContainerManagerImpl#recoverContainer.
The scenario could be reproduction by the following steps:
- Make sure YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true in NM
- Submit an application and keep running
- Restart NM
- Stop the application
- Now you get the negative values