This document gives an overview of NodeManager (NM) restart, a feature that enables the NodeManager to be restarted without losing the active containers running on the node. At a high level, the NM stores any necessary state to a local state-store as it processes container-management requests. When the NM restarts, it recovers by first loading state for various subsystems and then letting those subsystems perform recovery using the loaded state.
| Property | Value |
|---|---|
| yarn.nodemanager.recovery.enabled | true, (default value is set to false) |
| Property | Description |
|---|---|
| yarn.nodemanager.recovery.dir | The local filesystem directory in which the node manager will store state when recovery is enabled. The default value is set to $hadoop.tmp.dir/yarn-nm-recovery. |
| Property | Description |
|---|---|
| yarn.nodemanager.address | Ephemeral ports (port 0, which is default) cannot be used for the NodeManager's RPC server specified via yarn.nodemanager.address as it can make NM use different ports befor and after a restart. This will break any previously running clients that were communicating with the NM before restart. Explicitly setting yarn.nodemanager.address to an address with specific port number (for e.g 0.0.0.0:45454) is a precondition for enabling NM restart>>>. |
NodeManager can be configured to run auxiliary services. A simple example is the ShuffleService for MapReduce. For enabling NM restart, this and any other auxiliary service also needs to be configured correctly. This includes (1) avoiding usage of ephemeral ports (e.g. mapreduce.shuffle.port) and (2) having the auxiliary-service recoverable from previous state when NodeManager restarts and reinitialized the auxiliary service.