NodeManager Restart

Introduction

This document gives an overview of NodeManager (NM) restart, a feature that enables the NodeManager to be restarted without losing the active containers running on the node. At a high level, the NM stores any necessary state to a local state-store as it processes container-management requests. When the NM restarts, it recovers by first loading state for various subsystems and then letting those subsystems perform recovery using the loaded state.

Enabling NM Restart

  1. To enable NM Restart functionality, set the following property in conf/yarn-site.xml to true:
    Property Value
    yarn.nodemanager.recovery.enabled true, (default value is set to false)
  2. Configure a path to the local file-system directory where the NodeManager can save its run state
    Property Description
    yarn.nodemanager.recovery.dir The local filesystem directory in which the node manager will store state when recovery is enabled. The default value is set to $hadoop.tmp.dir/yarn-nm-recovery.
  3. Configure a valid RPC address for the NodeManager
    Property Description
    yarn.nodemanager.address Ephemeral ports (port 0, which is default) cannot be used for the NodeManager's RPC server specified via yarn.nodemanager.address as it can make NM use different ports befor and after a restart. This will break any previously running clients that were communicating with the NM before restart. Explicitly setting yarn.nodemanager.address to an address with specific port number (for e.g 0.0.0.0:45454) is a precondition for enabling NM restart>>>.
  4. Auxiliary services

    NodeManager can be configured to run auxiliary services. A simple example is the ShuffleService for MapReduce. For enabling NM restart, this and any other auxiliary service also needs to be configured correctly. This includes (1) avoiding usage of ephemeral ports (e.g. mapreduce.shuffle.port) and (2) having the auxiliary-service recoverable from previous state when NodeManager restarts and reinitialized the auxiliary service.