Nice discussion, Devaraj K!
If there are some long running containers in the NM and RMAdmin CLI gets terminated before issuing forceful decommission then the NM could in the “DECOMMISSIONING” state irrespective of timeout. AM I missing anything?
If users terminate the blocking/pending CLI, then it only means they want to track timeout themselves or they want to adjust timeout value ahead or delay. In this case, the decommissioning nodes either get decommissioned when app finished (a clean quit), or wait user to decommission again later. We can add some alert messages later if some nodes are in decommissioning stage for really long time. The basic idea is we agree to not tracking timeout in RM side for each individual nodes.
If we don't pass timeout to RM then how are we going to achieve this? You mean this will be handled later, once the basic things are done.
You are right that timeout value could be useful to pass down to AM for preemption containers (however, no any effect on terminating nodes). Let's keep it here and we can leverage it later when we are notifying AM.
For making timeout longer, if we use new CLI then there is a chance of forceful decommission happening with the old CLI timeout. Is there any constraint like this needs to be done with the same CLI?
Not quite understanding the case described here. Users should terminate the current CLI and launch a new CLI for adjusted timeout values if they want to wait shorter or longer. If it already passed previous timeout values, current CLI should quit already with all nodes decommissioned. Am I missing something here?