diff --git hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRestart.apt.vm hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRestart.apt.vm index 30a3a64..16a7ca5 100644 --- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRestart.apt.vm +++ hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRestart.apt.vm @@ -32,23 +32,26 @@ ResourceManger Restart ResourceManager Restart feature is divided into two phases: - ResourceManager Restart Phase 1: Enhance RM to persist application/attempt state + ResourceManager Restart Phase 1 (Non-work-preserving RM restart): + Enhance RM to persist application/attempt state and other credentials information in a pluggable state-store. RM will reload this information from state-store upon restart and re-kick the previously running applications. Users are not required to re-submit the applications. - ResourceManager Restart Phase 2: - Focus on re-constructing the running state of ResourceManger by reading back + ResourceManager Restart Phase 2 (Work-preserving RM restart): + Focus on re-constructing the running state of ResourceManger by combining the container statuses from NodeMangers and container requests from ApplicationMasters upon restart. The key difference from phase 1 is that previously running applications will not be killed after RM restarts, and so applications won't lose its work because of RM outage. +* {Feature} + +** Phase 1: Non-work-preserving RM restart + As of Hadoop 2.4.0 release, only ResourceManager Restart Phase 1 is implemented which is described below. -* {Feature} - The overall concept is that RM will persist the application metadata (i.e. ApplicationSubmissionContext) in a pluggable state-store when client submits an application and also saves the final status @@ -65,10 +68,10 @@ ResourceManger Restart NodeMangers and clients during the down-time of RM will keep polling RM until RM comes up. When RM becomes alive, it will send a re-sync command to all the NodeMangers and ApplicationMasters it was talking to via heartbeats. - Today, the behaviors for NodeMangers and ApplicationMasters to handle this command + As of Hadoop 2.4.0 release, the behaviors for NodeMangers and ApplicationMasters to handle this command are: NMs will kill all its managed containers and re-register with RM. From the RM's perspective, these re-registered NodeManagers are similar to the newly joining NMs. - AMs(e.g. MapReduce AM) today are expected to shutdown when they receive the re-sync command. + AMs(e.g. MapReduce AM) are expected to shutdown when they receive the re-sync command. After RM restarts and loads all the application metadata, credentials from state-store and populates them into memory, it will create a new attempt (i.e. ApplicationMaster) for each application that was not yet completed @@ -76,13 +79,33 @@ ResourceManger Restart applications' work is lost in this manner since they are essentially killed by RM via the re-sync command on restart. -* {Configurations} - - This section describes the configurations involved to enable RM Restart feature. +** Phase 2: Work-preserving RM restart + + As of Hadoop 2.6.0, we further enhanced RM restart feature to address the problem + to not kill any applications running on YARN cluster if RM restarts. + + Beyond all the groundwork that has been done in Phase 1 to ensure the persistency + of application state and reload that state on recovery, Phase 2 primarily focuses + on re-constructing the entire running state of YARN cluster, the majority of which is + the state of the central scheduler inside RM which keeps track of all containers' life-cycle, + applications' headroom and resource requests, queues' resource usage etc. In this way, + RM doesn't need to kill the AM and re-run the application from scratch as it is + done in Phase 1. Applications can simply re-sync back with RM and + resume from where it were left off. + + RM recovers its runing state by taking advantage of the container statuses sent from all NMs. + NM will not kill the containers when it re-syncs with the restarted RM. It continues + managing the containers and send the container statuses across to RM when it re-registers. + RM reconstructs the container instances and the associated applications' scheduling status by + absorbing these containers' information. In the meantime, AM needs to re-send the + outstanding resource requests to RM because RM may lose the unfulfilled requests when it shuts down. + Application writers using AMRMClient library to communicate with RM do not need to + worry about the part of AM re-sending resource requests to RM on re-sync, as it is + automatically taken care by the library itself. - * Enable ResourceManager Restart functionality. +* {Configurations} - To enable RM Restart functionality, set the following property in <> to true: +** Enable RM Restart. *--------------------------------------+--------------------------------------+ || Property || Value | @@ -92,9 +115,10 @@ ResourceManger Restart *--------------------------------------+--------------------------------------+ - * Configure the state-store that is used to persist the RM state. +** Configure the state-store for persisting the RM state. -*--------------------------------------+--------------------------------------+ + +*--------------------------------------*--------------------------------------+ || Property || Description | *--------------------------------------+--------------------------------------+ | <<>> | | @@ -108,9 +132,9 @@ ResourceManger Restart | | <<>>. | *--------------------------------------+--------------------------------------+ - * Configurations when using Hadoop FileSystem based state-store implementation. +** Configurations for Hadoop FileSystem based state-store implementation. - Configure the URI where the RM state will be saved in the Hadoop FileSystem state-store. + Configure the URI where the RM state will be saved in the Hadoop FileSystem state-store. *--------------------------------------+--------------------------------------+ || Property || Description | @@ -123,8 +147,8 @@ ResourceManger Restart | | <> will be used. | *--------------------------------------+--------------------------------------+ - Configure the retry policy state-store client uses to connect with the Hadoop - FileSystem. + Configure the retry policy state-store client uses to connect with the Hadoop + FileSystem. *--------------------------------------+--------------------------------------+ || Property || Description | @@ -137,9 +161,9 @@ ResourceManger Restart | | Default value is (2000, 500) | *--------------------------------------+--------------------------------------+ - * Configurations when using ZooKeeper based state-store implementation. +** Configurations for ZooKeeper based state-store implementation. - Configure the ZooKeeper server address and the root path where the RM state is stored. + Configure the ZooKeeper server address and the root path where the RM state is stored. *--------------------------------------+--------------------------------------+ || Property || Description | @@ -154,7 +178,7 @@ ResourceManger Restart | | Default value is /rmstore. | *--------------------------------------+--------------------------------------+ - Configure the retry policy state-store client uses to connect with the ZooKeeper server. + Configure the retry policy state-store client uses to connect with the ZooKeeper server. *--------------------------------------+--------------------------------------+ || Property || Description | @@ -175,7 +199,7 @@ ResourceManger Restart | | value is 10 seconds | *--------------------------------------+--------------------------------------+ - Configure the ACLs to be used for setting permissions on ZooKeeper znodes. + Configure the ACLs to be used for setting permissions on ZooKeeper znodes. *--------------------------------------+--------------------------------------+ || Property || Description | @@ -184,7 +208,7 @@ ResourceManger Restart | | ACLs to be used for setting permissions on ZooKeeper znodes. Default value is <<>> | *--------------------------------------+--------------------------------------+ - * Configure the max number of application attempt retries. +** Configure the max number of application attempt retries (Only required to enable non-work-preserving restart). *--------------------------------------+--------------------------------------+ || Property || Description | @@ -198,7 +222,9 @@ ResourceManger Restart | | allow at least one retry for AM. | *--------------------------------------+--------------------------------------+ - This configuration's impact is in fact beyond RM restart scope. It controls + This configuration is only required to enable non-work-preserving restart, + not needed for work-preserving restart. This configuration's impact is in + fact beyond RM restart scope. It controls the max number of attempts an application can have. In RM Restart Phase 1, this configuration is needed since as described earlier each time RM restarts, it kills the previously running attempt (i.e. ApplicationMaster) and @@ -206,3 +232,75 @@ ResourceManger Restart attempt count to increase by 1. In RM Restart phase 2, this configuration is not needed since the previously running ApplicationMaster will not be killed and the AM will just re-sync back with RM after RM restarts. + +** Configurations for work-preserving RM recovery. + +*--------------------------------------+--------------------------------------+ +|| Property || Description | +*--------------------------------------+--------------------------------------+ +| <<>> | | +| | Enable RM work preserving recovery. This configuration is private | +| | to YARN for experimenting the feature. As of Hadoop 2.6.0, this configuration | +| | is false by default. It will be true by default in future releases. This | +| | configuration needs to be set in both RM and all NMs.| +*--------------------------------------+--------------------------------------+ +| <<>> | | +| | Set the amount of time RM waits before allocating new | +| | containers on RM work-preserving recovery. Such wait period gives RM a chance | +| | to settle down resyncing with NMs in the cluster on recovery, before assigning| +| | new containers to applications.| +*--------------------------------------+--------------------------------------+ + +* {Notes} + + ContainerId string format is changed if RM restarts with work-preserving recovery enabled. + It used to be such format: + + Container_\{clusterTimestamp\}_\{appId\}_\{attemptId\}_\{containerId\}, e.g. Container_1410901177871_0001_01_000005. + + It is now changed to: + + Container_<>_\{clusterTimestamp\}_\{appId\}_\{attemptId\}_\{containerId\}, e.g. Container_<>_1410901177871_0001_01_000005. + + Here, the additional epoch number is a + monotonically increasing integer which starts from 0 and is increased by 1 each time + RM restarts. If epoch number is 0, it is omitted and the containerId string format + stays the same as before. + +* {Sample configurations} + + Below is a minimum set of configurations for enabling RM work-preserving restart + ++---+ + + Enable RM to recover state after starting. If true, then + yarn.resourcemanager.store.class must be specified + yarn.resourcemanager.recovery.enabled + true + + + + Enable RM work preserving recovery. This configuration is private + to YARN for experimenting the feature. As of Hadoop 2.6.0, this configuration is + false by default. It will be true by default in future releases. This configuration + needs to be set in both RM and all NMs. + + yarn.resourcemanager.work-preserving-recovery.enabled + true + + + + The class to use as the persistent store. + yarn.resourcemanager.store.class + org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore + + + + Host:Port of the ZooKeeper server where RM state will + be stored. This must be supplied when using + org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore + as the value for yarn.resourcemanager.store.class + yarn.resourcemanager.zk-address + 127.0.0.1:2181 + ++---+ \ No newline at end of file