do you have some writeup about the workflow of work-preserving NM restart?
Not yet. I'll try to cover the high points below in the interim.
According to the current sub tasks, I can see that we need a NMStateStore
Rather than explicitly call that out as a subtask I was expecting that would be organically grown and extended as the persistence and restore for each piece of context was added. Having a subtask for doing the entire state store didn't make as much sense since the people working on persisting/restoring the separate context pieces will know best what they require of the state store. Otherwise it seems like the state store subtask is 80% of the work.
Beyond this, how does NM contact RM and AM about its reserved work?
We don't currently see any need for the NM to contact the AM about anything related to restart. The whole point of the restart is to be as transparent as possible to the rest of the system – as if the restart never occurred. As for the RM, we do need a small change where it no longer assumes all containers have been killed when a NM registers redundantly (i.e.: RM has not yet expired the node yet it is registering again). That should be covered in the container state recovery subtask or we can create an explicit separate subtask for that.
How do we distinguish NM restart and shutdown?
That is a good question and something we need to determine. I'll file a separate subtask to track it. The current thinking is that if recovery is enabled for the NM then it will persist its state as it changes and support restarting/rejoining the cluster without containers being lost if it can do so before the RM notices (i.e.: expires the NM). Then we could use this feature not only for rolling upgrades but also for running the NM under a supervisor that can restart it if it crashes without catastrophic loss to the workload running on that node at the time. This requires decommission (i.e.: a shutdown where the NM is not coming back anytime soon) to be a separate, explicit command sent to the NM (either via special signal or admin RPC call) so the NM knows to cleanup containers and directories since it will not be coming back anytime soon.
At a high level, the NM performs these tasks during restart:
- Recover last known state (containers, distcache contents, active tokens, pending deletions, etc.)
- Reacquire active containers and noting containers that have exited in the interim since the last known state.
- Re-register with the RM. If RM has not expired the NM, then the containers are still valid and we can proceed to update the RM with any containers that have exited
- Recover log-aggregations in progress (which may involve re-uploading logs that were in-progress)
- Resume pending deletions
- Recover container localizations in progress (either reconnect or abort-and-retry)
We don't anticipate any changes to AMs, etc. The NM will be temporarily unavailable while it restarts, but the unavailability should be on the order of seconds.