Thanks for the review Jian He !
This check should not be needed, because AM should be able to resize an existing container no matter RM restarted or not.
I have some concerns regarding this that I hope to get some clarifications. According to the work-preserving RM restart documentation (http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRestart.html):
RM recovers its runing state by taking advantage of the container statuses sent from all NMs. NM will not kill the containers when it re-syncs with the restarted RM. It continues managing the containers and send the container statuses across to RM when it re-registers. RM reconstructs the container instances and the associated applications’ scheduling status by absorbing these containers’ information
Consider this scenario:
- RM approves a container resource increase request and sends an increase token to AM.
- Before AM actually increases the resource on NM, RM crashes and then restarts. Because of the work preserving recovery, RM re-constructs the container resource based on the information sent by NM, and it is still the old resource allocation for the container before the increase.
- Now AM does the increase action on NM. If NM doesn't reject this, it will start to enforce the container with increased resource. Now the views of resource allocation between RM and NM are inconsistent.
A lot of code is duplicate between authorizeStartRequest and authorizeResourceIncreaseRequest - could you refactor the code to share the same code ?
Portion of the code belongs to
YARN-1644 and the patch won't compile.
This is the same situations with
YARN-1449. Everything is intertwined May need to combine everything into a big patch to submit for jenkins build.