We should also consider graceful NM decommission. For graceful decommission, the RM should refrain from assigning more tasks to the node in question. Should we also prevent AMs that have already been assigned this node from starting new containers? In that case, I guess we would not be throwing NMNotYetReadyException, but another YarnException - NMShuttingDownException?
Karthik Kambatla, we could. Let's file a separate JIRA?
we should just avoid opening or processing the client port until we've registered with the RM if it's really a problem in practice
Jason Lowe, this is not possible to do as the NM needs to report the RPC server port during registration - so, server start should happen before registration.
2. For NM restart with no recovery support, startContainer will fail anyways because the NMToken is not valid.
3. For work-preserving RM restart, containers launched before NM re-register can be recovered on RM when NM sends the container status across. startContainer call after re-register will fail because the NMToken is not valid.
Jian He, these two errors will be much harder for apps to process and react to than the current named exception.
Further, things like Auxiliary services are also not setup already by time the RPC server starts and depending on how the service order changes over time, users may get different types of errors. Overall, I am in favor of keeping the named exception with clients explicitly retrying.