Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 2.3.0
    • Fix Version/s: None
    • Component/s: nodemanager
    • Labels:
      None

      Description

      This serves as an umbrella ticket for tasks related to work-preserving nodemanager restart.

      1. YARN-1336-rollup-v2.patch
        108 kB
        Jason Lowe
      2. YARN-1336-rollup.patch
        237 kB
        Jason Lowe
      3. NMRestartDesignOverview.pdf
        146 kB
        Jason Lowe

        Issue Links

          Activity

          Hide
          Jason Lowe added a comment -

          Upgrading all of the nodes in a cluster for a rolling upgrade can be a very disruptive or lengthy process. If the nodemanager is taken down then all active containers on that node are killed. This is disruptive to jobs with long-running tasks, especially if one of the tasks ends up hitting this situation across multiple attempts. An alternative would be a drain-decommision for nodes as proposed in YARN-914. However with long-running applications/tasks it can take a very long time to decommission a node, as we have to not only wait for the active containers to complete but also active applications in general (e.g.: node still has to serve up map task data after map task completes, so auxiliary services can have responsibilities beyond the active containers). Performing a rolling upgrade on a large cluster will take a very long time if we need to wait for a clean drain-decommission of each node.

          Therefore it would be nice if the nodemanager supported a mode where it could be restarted and recover state. This would include recovering active container state, tokens, localized resource cache state, etc. We could then bounce the nodemanager to an updated version without losing containers and with minimal impact to jobs running on the grid, and the time to perform a rolling upgrade of a large cluster would no longer be tied to the running time of applications currently active on the cluster.

          Show
          Jason Lowe added a comment - Upgrading all of the nodes in a cluster for a rolling upgrade can be a very disruptive or lengthy process. If the nodemanager is taken down then all active containers on that node are killed. This is disruptive to jobs with long-running tasks, especially if one of the tasks ends up hitting this situation across multiple attempts. An alternative would be a drain-decommision for nodes as proposed in YARN-914 . However with long-running applications/tasks it can take a very long time to decommission a node, as we have to not only wait for the active containers to complete but also active applications in general (e.g.: node still has to serve up map task data after map task completes, so auxiliary services can have responsibilities beyond the active containers). Performing a rolling upgrade on a large cluster will take a very long time if we need to wait for a clean drain-decommission of each node. Therefore it would be nice if the nodemanager supported a mode where it could be restarted and recover state. This would include recovering active container state, tokens, localized resource cache state, etc. We could then bounce the nodemanager to an updated version without losing containers and with minimal impact to jobs running on the grid, and the time to perform a rolling upgrade of a large cluster would no longer be tied to the running time of applications currently active on the cluster.
          Hide
          Cindy Li added a comment -

          I'm interested in working on this. I'm working on another related Jira: Yarn-671 https://issues.apache.org/jira/browse/YARN-671

          Show
          Cindy Li added a comment - I'm interested in working on this. I'm working on another related Jira: Yarn-671 https://issues.apache.org/jira/browse/YARN-671
          Hide
          Joep Rottinghuis added a comment -

          We're finding the same during our cluster upgrades now that HA is making rolling upgrades possible. This would be an extremely good feature to have.

          Show
          Joep Rottinghuis added a comment - We're finding the same during our cluster upgrades now that HA is making rolling upgrades possible. This would be an extremely good feature to have.
          Hide
          Zhijie Shen added a comment -

          It sounds a nice feature. I thought about it a bit before: as we allow RM to restart, why not NM? Jason Lowe, do you have some writeup about the workflow of work-preserving NM restart? If you have, would you mind sharing it? I'm curious about the design. According to the current sub tasks, I can see that we need a NMStateStore (like RMStateStore for RM) to store the aforementioned information when NM stops, and to recover all the states, when NM starts again. Beyond this, how does NM contact RM and AM about its reserved work?

          I've another question w.r.t this feature. How do we distinguish NM restart and shutdown? If an NM shutdowns, and never come back, should the work still be preserved (or trapped) there? Currently, NM will notify of killing the containers on it immediately, and the application has the chance to start another container to do its work.

          Show
          Zhijie Shen added a comment - It sounds a nice feature. I thought about it a bit before: as we allow RM to restart, why not NM? Jason Lowe , do you have some writeup about the workflow of work-preserving NM restart? If you have, would you mind sharing it? I'm curious about the design. According to the current sub tasks, I can see that we need a NMStateStore (like RMStateStore for RM) to store the aforementioned information when NM stops, and to recover all the states, when NM starts again. Beyond this, how does NM contact RM and AM about its reserved work? I've another question w.r.t this feature. How do we distinguish NM restart and shutdown? If an NM shutdowns, and never come back, should the work still be preserved (or trapped) there? Currently, NM will notify of killing the containers on it immediately, and the application has the chance to start another container to do its work.
          Hide
          Jason Lowe added a comment -

          do you have some writeup about the workflow of work-preserving NM restart?

          Not yet. I'll try to cover the high points below in the interim.

          According to the current sub tasks, I can see that we need a NMStateStore

          Rather than explicitly call that out as a subtask I was expecting that would be organically grown and extended as the persistence and restore for each piece of context was added. Having a subtask for doing the entire state store didn't make as much sense since the people working on persisting/restoring the separate context pieces will know best what they require of the state store. Otherwise it seems like the state store subtask is 80% of the work.

          Beyond this, how does NM contact RM and AM about its reserved work?

          We don't currently see any need for the NM to contact the AM about anything related to restart. The whole point of the restart is to be as transparent as possible to the rest of the system – as if the restart never occurred. As for the RM, we do need a small change where it no longer assumes all containers have been killed when a NM registers redundantly (i.e.: RM has not yet expired the node yet it is registering again). That should be covered in the container state recovery subtask or we can create an explicit separate subtask for that.

          How do we distinguish NM restart and shutdown?

          That is a good question and something we need to determine. I'll file a separate subtask to track it. The current thinking is that if recovery is enabled for the NM then it will persist its state as it changes and support restarting/rejoining the cluster without containers being lost if it can do so before the RM notices (i.e.: expires the NM). Then we could use this feature not only for rolling upgrades but also for running the NM under a supervisor that can restart it if it crashes without catastrophic loss to the workload running on that node at the time. This requires decommission (i.e.: a shutdown where the NM is not coming back anytime soon) to be a separate, explicit command sent to the NM (either via special signal or admin RPC call) so the NM knows to cleanup containers and directories since it will not be coming back anytime soon.

          At a high level, the NM performs these tasks during restart:

          • Recover last known state (containers, distcache contents, active tokens, pending deletions, etc.)
          • Reacquire active containers and noting containers that have exited in the interim since the last known state.
          • Re-register with the RM. If RM has not expired the NM, then the containers are still valid and we can proceed to update the RM with any containers that have exited
          • Recover log-aggregations in progress (which may involve re-uploading logs that were in-progress)
          • Resume pending deletions
          • Recover container localizations in progress (either reconnect or abort-and-retry)

          We don't anticipate any changes to AMs, etc. The NM will be temporarily unavailable while it restarts, but the unavailability should be on the order of seconds.

          Show
          Jason Lowe added a comment - do you have some writeup about the workflow of work-preserving NM restart? Not yet. I'll try to cover the high points below in the interim. According to the current sub tasks, I can see that we need a NMStateStore Rather than explicitly call that out as a subtask I was expecting that would be organically grown and extended as the persistence and restore for each piece of context was added. Having a subtask for doing the entire state store didn't make as much sense since the people working on persisting/restoring the separate context pieces will know best what they require of the state store. Otherwise it seems like the state store subtask is 80% of the work. Beyond this, how does NM contact RM and AM about its reserved work? We don't currently see any need for the NM to contact the AM about anything related to restart. The whole point of the restart is to be as transparent as possible to the rest of the system – as if the restart never occurred. As for the RM, we do need a small change where it no longer assumes all containers have been killed when a NM registers redundantly (i.e.: RM has not yet expired the node yet it is registering again). That should be covered in the container state recovery subtask or we can create an explicit separate subtask for that. How do we distinguish NM restart and shutdown? That is a good question and something we need to determine. I'll file a separate subtask to track it. The current thinking is that if recovery is enabled for the NM then it will persist its state as it changes and support restarting/rejoining the cluster without containers being lost if it can do so before the RM notices (i.e.: expires the NM). Then we could use this feature not only for rolling upgrades but also for running the NM under a supervisor that can restart it if it crashes without catastrophic loss to the workload running on that node at the time. This requires decommission (i.e.: a shutdown where the NM is not coming back anytime soon) to be a separate, explicit command sent to the NM (either via special signal or admin RPC call) so the NM knows to cleanup containers and directories since it will not be coming back anytime soon. At a high level, the NM performs these tasks during restart: Recover last known state (containers, distcache contents, active tokens, pending deletions, etc.) Reacquire active containers and noting containers that have exited in the interim since the last known state. Re-register with the RM. If RM has not expired the NM, then the containers are still valid and we can proceed to update the RM with any containers that have exited Recover log-aggregations in progress (which may involve re-uploading logs that were in-progress) Resume pending deletions Recover container localizations in progress (either reconnect or abort-and-retry) We don't anticipate any changes to AMs, etc. The NM will be temporarily unavailable while it restarts, but the unavailability should be on the order of seconds.
          Hide
          Cindy Li added a comment -

          In the case of rolling upgrade, e.g. some new configuration or fix would be picked up when node manager restarts, would that cause any issue during the state/work recovering process?

          Show
          Cindy Li added a comment - In the case of rolling upgrade, e.g. some new configuration or fix would be picked up when node manager restarts, would that cause any issue during the state/work recovering process?
          Hide
          Jason Lowe added a comment -

          It depends upon the nature of the config change or fix. In essence this is no different than the RM restart use-case today. Any config changes or fixes need to keep recovery on startup in mind. Most fixes won't be an issue, but anything that changes the syntax or semantics of the state store data or recovery process in general will have to deal with the state store format from a previous version to remain compatible.

          Ideally we'd like to be able to support work-preserving rolling upgrades as well as work-preserving rolling downgrades, so one can smoothly recover from a spoiled upgrade without taking down the whole cluster. If the persisted state format isn't changing then this should be straightforward. However if the state format does change between versions and we end up only supporting a one-way conversion from the old format to the new format then that would be a case where we support a work-preserving rolling upgrade but not a work-preserving rolling downgrade between those versions. A downgrade would still be possible with the loss of containers, of course, by simply removing the state store data and restarting.

          In summary, we would need to be cognizant of changes that affect state recovery upon startup so a work-preserving restart can be used to support work-preserving rolling upgrades. This applies to both the RM and the NM.

          Show
          Jason Lowe added a comment - It depends upon the nature of the config change or fix. In essence this is no different than the RM restart use-case today. Any config changes or fixes need to keep recovery on startup in mind. Most fixes won't be an issue, but anything that changes the syntax or semantics of the state store data or recovery process in general will have to deal with the state store format from a previous version to remain compatible. Ideally we'd like to be able to support work-preserving rolling upgrades as well as work-preserving rolling downgrades, so one can smoothly recover from a spoiled upgrade without taking down the whole cluster. If the persisted state format isn't changing then this should be straightforward. However if the state format does change between versions and we end up only supporting a one-way conversion from the old format to the new format then that would be a case where we support a work-preserving rolling upgrade but not a work-preserving rolling downgrade between those versions. A downgrade would still be possible with the loss of containers, of course, by simply removing the state store data and restarting. In summary, we would need to be cognizant of changes that affect state recovery upon startup so a work-preserving restart can be used to support work-preserving rolling upgrades. This applies to both the RM and the NM.
          Hide
          Ming Ma added a comment -

          Jason, nice work and thanks for driving this. Couple comments:

          1. One of the scenarios for NM restart is NM config update. In that scenario, it might be worth calling out having NM to support dynamic config reload could be one design option; not necessaily something we should do.

          2. It seems your design is based on quick NM restart and there is no need to kill the existing containers during NM restart. That will make the design simple. There is one scenario where we want to decomm the node and would like to preserve the state of long running tasks. For that somehow RM and AM will need to know about it so that it can checkpoint and resume the tasks on other nodes. Lots of work has been done in preemption space for that. Is that something covered here?

          3. ShuffleHandler support. ShuffleHandler is a component above YARN. There might be some scenarios where we just need to update NM without update of ShuffleHandler or the other way. I don't know your approach. Will making ShuffleHandler be an out-of-proc help? During NM restart ShuffleHandler process just keeps running. NM will create the proxy an object to reconnect to the ShuffleHandler process. If we end up having several AuxiliaryServices for different type of applications, out-of-proc approach also makes it easier to manage from resource utilization and reduce the impact of one type of AuxiliaryService on the other.

          Show
          Ming Ma added a comment - Jason, nice work and thanks for driving this. Couple comments: 1. One of the scenarios for NM restart is NM config update. In that scenario, it might be worth calling out having NM to support dynamic config reload could be one design option; not necessaily something we should do. 2. It seems your design is based on quick NM restart and there is no need to kill the existing containers during NM restart. That will make the design simple. There is one scenario where we want to decomm the node and would like to preserve the state of long running tasks. For that somehow RM and AM will need to know about it so that it can checkpoint and resume the tasks on other nodes. Lots of work has been done in preemption space for that. Is that something covered here? 3. ShuffleHandler support. ShuffleHandler is a component above YARN. There might be some scenarios where we just need to update NM without update of ShuffleHandler or the other way. I don't know your approach. Will making ShuffleHandler be an out-of-proc help? During NM restart ShuffleHandler process just keeps running. NM will create the proxy an object to reconnect to the ShuffleHandler process. If we end up having several AuxiliaryServices for different type of applications, out-of-proc approach also makes it easier to manage from resource utilization and reduce the impact of one type of AuxiliaryService on the other.
          Hide
          Jason Lowe added a comment -

          One of the scenarios for NM restart is NM config update.

          If we want to update the NM configs without a restart, I think that's a separate effort. In theory that's not strictly necessary if the NM preserves work when it restarts, but there will be cases where an NM restart can cause 'hiccups' (i.e.: we're mid-shuffle and have to retry or need to retry a container launch request).

          There is one scenario where we want to decomm the node and would like to preserve the state of long running tasks. For that somehow RM and AM will need to know about it so that it can checkpoint and resume the tasks on other nodes.

          Task checkpointing or task migration is not in the scope of this work.

          Will making ShuffleHandler be an out-of-proc help?

          I'm not planning on moving the aux services out of the NM as part of this effort. As you point out, it's something that would be good to do to separate concerns even without the NM restart feature. I agree the experience of aux service clients will be much smoother if they're moved out and the scenario is we're only restarting the NM, but I don't see moving them out as a pre-requisite to supporting basic NM restart functionality. As such I think moving the aux services outside the NM should be a separate JIRA that could theoretically be completed before or after this feature.

          Show
          Jason Lowe added a comment - One of the scenarios for NM restart is NM config update. If we want to update the NM configs without a restart, I think that's a separate effort. In theory that's not strictly necessary if the NM preserves work when it restarts, but there will be cases where an NM restart can cause 'hiccups' (i.e.: we're mid-shuffle and have to retry or need to retry a container launch request). There is one scenario where we want to decomm the node and would like to preserve the state of long running tasks. For that somehow RM and AM will need to know about it so that it can checkpoint and resume the tasks on other nodes. Task checkpointing or task migration is not in the scope of this work. Will making ShuffleHandler be an out-of-proc help? I'm not planning on moving the aux services out of the NM as part of this effort. As you point out, it's something that would be good to do to separate concerns even without the NM restart feature. I agree the experience of aux service clients will be much smoother if they're moved out and the scenario is we're only restarting the NM, but I don't see moving them out as a pre-requisite to supporting basic NM restart functionality. As such I think moving the aux services outside the NM should be a separate JIRA that could theoretically be completed before or after this feature.
          Hide
          Ming Ma added a comment -

          Thanks Jason for the clarification. I have commented on https://issues.apache.org/jira/i#browse/YARN-914 for the NM decomm scenario and opened https://issues.apache.org/jira/i#browse/YARN-1593 for out-of-proc AuxiliaryServices work.

          Show
          Ming Ma added a comment - Thanks Jason for the clarification. I have commented on https://issues.apache.org/jira/i#browse/YARN-914 for the NM decomm scenario and opened https://issues.apache.org/jira/i#browse/YARN-1593 for out-of-proc AuxiliaryServices work.
          Hide
          Jason Lowe added a comment -

          Attaching a rollup patch for the prototype that Ravi Prakash and I developed. This recovers resource localization state, applications and containers, tokens, log aggregation, deletion service, and the MR shuffle auxiliary service. A quick high-level overview:

          • Restart functionality is enabled by configuring yarn.nodemanager.recovery.enabled to true and yarn.nodemanager.recovery.dir to a directory on the local filesystem where the state will be stored.
          • Containers are launched with an additional shell layer which places the exit code of the container in an .exitcode file. This allows the restarted NM instance to recover containers that are already running or have exited since the last NM instance.
          • NMStateStoreService is the abstraction layer for the state store. NMNullStateStoreService is used when recovery is disabled and NMLevelDBStateStoreService is used when it is enabled.
          • Rather than explicitly record localized resource reference counts, resources are recovered with no references and recovered containers re-request their resources as during a normal container lifecycle to restore the reference counts.

          Some things that are still missing:

          • ability to distinguish shutdown for restart vs. decommission
          • proper handling of state store errors
          • adding unit tests
          • adding formal documentation.

          Feedback is greatly appreciated. I'll be working on addressing the missing items and splitting the patch into smaller pieces across the appropriate subtasks to simplify reviews.

          Show
          Jason Lowe added a comment - Attaching a rollup patch for the prototype that Ravi Prakash and I developed. This recovers resource localization state, applications and containers, tokens, log aggregation, deletion service, and the MR shuffle auxiliary service. A quick high-level overview: Restart functionality is enabled by configuring yarn.nodemanager.recovery.enabled to true and yarn.nodemanager.recovery.dir to a directory on the local filesystem where the state will be stored. Containers are launched with an additional shell layer which places the exit code of the container in an .exitcode file. This allows the restarted NM instance to recover containers that are already running or have exited since the last NM instance. NMStateStoreService is the abstraction layer for the state store. NMNullStateStoreService is used when recovery is disabled and NMLevelDBStateStoreService is used when it is enabled. Rather than explicitly record localized resource reference counts, resources are recovered with no references and recovered containers re-request their resources as during a normal container lifecycle to restore the reference counts. Some things that are still missing: ability to distinguish shutdown for restart vs. decommission proper handling of state store errors adding unit tests adding formal documentation. Feedback is greatly appreciated. I'll be working on addressing the missing items and splitting the patch into smaller pieces across the appropriate subtasks to simplify reviews.
          Hide
          Karthik Kambatla (Inactive) added a comment -

          Thanks for the update, Jason.

          I just tried it on a pseudo-dist cluster - on-going containers continue to make progress across an NM restart. It looks very neat! I also barely skimmed over the rollup patch, things look promising.

          Show
          Karthik Kambatla (Inactive) added a comment - Thanks for the update, Jason. I just tried it on a pseudo-dist cluster - on-going containers continue to make progress across an NM restart. It looks very neat! I also barely skimmed over the rollup patch, things look promising.
          Hide
          Jason Lowe added a comment -

          Attaching a PDF that briefly describes the approach and how the methods of the state store interface are used to persist and recover state.

          Show
          Jason Lowe added a comment - Attaching a PDF that briefly describes the approach and how the methods of the state store interface are used to persist and recover state.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Tx for the doc, Jason!

          Show
          Vinod Kumar Vavilapalli added a comment - Tx for the doc, Jason!
          Hide
          Jason Lowe added a comment -

          Refreshing the rollup patch to latest trunk so it's easier for people to play with the feature and get a general sense of things before the rest of the patches are integrated. Notable fixes since the last rollup patch include fixing container reacquisition and avoiding deleting log directories on NM teardown when we're restarting.

          Show
          Jason Lowe added a comment - Refreshing the rollup patch to latest trunk so it's easier for people to play with the feature and get a general sense of things before the rest of the patches are integrated. Notable fixes since the last rollup patch include fixing container reacquisition and avoiding deleting log directories on NM teardown when we're restarting.
          Hide
          Junping Du added a comment -

          Good work, Jason Lowe! Thanks for sharing.

          Show
          Junping Du added a comment - Good work, Jason Lowe ! Thanks for sharing.
          Hide
          Junping Du added a comment -

          Jason Lowe, YARN-1354 just get in to the trunk. Do we plan to sync patch here and put to somewhere (I think it should be YARN-1337) for review?

          Show
          Junping Du added a comment - Jason Lowe , YARN-1354 just get in to the trunk. Do we plan to sync patch here and put to somewhere (I think it should be YARN-1337 ) for review?
          Hide
          Jason Lowe added a comment -

          I've updated YARN-1337 with a patch to recover containers. I believe that's the last uncommitted JIRA that was covered by the rollup patches posted here, so we may be past the point of needing rollup patches on this umbrella issue.

          Show
          Jason Lowe added a comment - I've updated YARN-1337 with a patch to recover containers. I believe that's the last uncommitted JIRA that was covered by the rollup patches posted here, so we may be past the point of needing rollup patches on this umbrella issue.
          Hide
          Junping Du added a comment -

          Got it. Will help to review YARN-1337. Thanks Jason Lowe!

          Show
          Junping Du added a comment - Got it. Will help to review YARN-1337 . Thanks Jason Lowe !

            People

            • Assignee:
              Jason Lowe
              Reporter:
              Jason Lowe
            • Votes:
              0 Vote for this issue
              Watchers:
              42 Start watching this issue

              Dates

              • Created:
                Updated:

                Development