Wangda Tan I totally agree that Yarn should not mess with Java Xmx. Sorry for not being clear before.
While digging into the design details of this issue, there is (I think) a piece that is missing from the original design doc, which I hope to get some insights/clarifications from the community:
It seems there is no discussion about the container resource increase token expiration logic.
Here is what I think that should happen:
1. AM sends a container resource increase request to RM.
2. RM grants the request, allocating the additional resource to the container, updating its internal resource bookkeeping.
3. During the next AM-RM heartbeat, RM pulls the newly increased container, creates a token for it and sets the token in the allocation response. In the meantime, RM starts a timer for this granted increase (e.g., register with ContainerResourceIncreaseExpirer).
4. AM acquires the container resource increase token from the heartbeat response, then calls the NMClient API to launch the container resource increase action on NM.
5. NM receives the request, increases the monitoring quota of the container, and notifies the NodeStatusUpdater.
6. The NodeStatusUpdater informs the increase success to RM during regular NM-RM heartbeat.
7. Upon receiving the increase success message, the RM stops the timer (e.g, unregister with ContainerResourceIncreaseExpirer).
If, however, the timer in RM expires, and no increase success message is received for this container, RM must release the increased resource to the container, and update its internal resource bookkeeing.
As such, NM-RM heartbeat must also include container resource increase message (which doesn't exist in the original design), otherwise the expiration logic will not work.
In addition, RM must remember the original resource allocation to the container (this info may be stored in the ContainerResourceIncreaseExpirer), because in the case of expiration, RM needs to release the increased resource and revert back to the original resource allocation. This is different from a newly allocated container, in which case, RM simply needs to release the resource for the entire container when it expires.
To make matters more complicated, after a container resource increase token has been given out, and before it expires, there is no guarantee that AM won't issue a resource decrease action on the same container. Because the resource decrease action starts from NM, NM has no idea that a resource increase token on the same container has been issued, and that a resource increase action could happen anytime.
Given the above, here is what I propose to simplify things as much as we can without compromising the main functionality:
At the RM side
1. During each scheduling, if RM finds that there are still granted container resource increase sitting in RM (i.e., not yet acquired by AM), it will skip scheduling any outstanding resource increase request to the same container.
2. During each scheduling, if RM finds that there is a granted container resource increase registered with ContainerResourceIncreaseExpirer, it will skip scheduling any outstanding resource increase request to the same container.
This will guarantee that at any time, there can be one and only one resource increase request for a container.
At the NM side
1. Create a map to track any resource increase or decrease action for a container in NMContext. At any time, there can only be either an increase action or a decrease action going on for a specific container. While an increase/decrease action is in progress in NM, any new request from AM to increase/decrease resource to the same container will be rejected (with proper error messages).
With the above logic, here is an example of what could happen:
1. A container is currently using 6G
2. AM asks RM to increase it to 8G
3. RM grants the increase request, allocates the resource to the container to 8G, and issues a token to AM. It starts a timer and remembers the original resource allocation before the increase as 6G.
4. AM, instead of initiating the resource increase to NM, requests a resource decrease to NM to decrease it to 4G
5. The decrease is successful and RM gets the notification, and updates the container resource to 4G
After this, two possible sequences may occur:
6. Before the token expires, the AM requests the resource increase to NM
7 The increase is successful and RM gets the notification, and updates the container resource back to 8G
6. AM never sends the resource increase to NM
7. The token expires in RM. RM attempts to revert the resource increase (i.e., set the resource allocation back to 6G), but seeing that it is currently using 4G, it will do nothing.
This is what I have for now. Sorry for the long email, and I hope that I have made myself clear. Please let me know if I am on the right track. Any comments/corrections/suggestions/advice will be extremely appreciated.