[YARN-556] [Umbrella] RM Restart phase 2 - Work preserving restart - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: graceful, resourcemanager, rolling upgrade
Labels:
None

Target Version/s:

2.6.0

Description

~~YARN-128~~ covered storing the state needed for the RM to recover critical information. This umbrella jira will track changes needed to recover the running state of the cluster so that work can be preserved across RM restarts.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

WorkPreservingRestartPrototype.001.patch
19/Apr/14 01:09
88 kB
Anubhav Dhoot
Work Preserving RM Restart.pdf
23/Aug/13 01:50
202 kB
Bikas Saha
YARN-1372.prelim.patch
01/Aug/14 18:08
53 kB
Anubhav Dhoot

Issue Links

is depended upon by

YARN-666 [Umbrella] Support rolling upgrades in YARN

Closed

is related to

YARN-128 [Umbrella] RM Restart Phase 1: State storage and non-work-preserving recovery

Resolved

relates to

YARN-6128 Add support for AMRMProxy HA

Resolved

MAPREDUCE-5567 [Umbrella] Stabilize MR framework w.r.t ResourceManager restart

Resolved

YARN-149 [Umbrella] ResourceManager (RM) Fail-over

Resolved

Sub-Tasks

1.	ApplicationMasterService to allow Register of an app that was running before restart	Closed	Anubhav Dhoot
2.	AM should implement Resync with the ApplicationMasterService instead of shutting down	Closed	Rohith Sharma K S
3.	After restart NM should resync with the RM without killing containers	Closed	Anubhav Dhoot
4.	Common work to re-populate containers’ state into scheduler	Closed	Jian He
5.	Capacity scheduler to re-populate container allocation state	Resolved	Jian He
6.	Fair scheduler to re-populate container allocation state	Closed	Anubhav Dhoot
7.	FIFO scheduler to re-populate container allocation state	Resolved	Jian He
8.	Ensure all completed containers are reported to the AMs across RM restart	Closed	Anubhav Dhoot
9.	Transition RMApp and RMAppAttempt state to RUNNING after restart for recovered running apps	Resolved	Omkar Vinit Joshi
10.	Revisit AM link being broken for work preserving restart	Resolved	Unassigned
11.	Recover Unmanaged AMs	Resolved	Anubhav Dhoot
12.	Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over	Closed	Tsuyoshi Ozawa
13.	Fix ordering of starting services inside the RM	Resolved	Jian He
14.	Threshold for RM to accept requests from AM after failover	Closed	Jian He
15.	Merge some of the common lib code in schedulers	Closed	Jian He
16.	ContainerId creation after work preserving restart is broken	Closed	Tsuyoshi Ozawa
17.	Replace RegisterNodeManagerRequest#ContainerStatus with a new NMContainerStatus	Closed	Jian He
18.	Recover missing container information	Closed	Jian He
19.	Ensure distributed shell work with RM work-preserving recovery	Closed	Jian He
20.	Update ContainerId#toString() to avoid conflicts before and after RM restart	Closed	Tsuyoshi Ozawa
21.	ContainerId can overflow with RM restart	Closed	Tsuyoshi Ozawa
22.	AM release request may be lost on RM restart	Closed	Jian He
23.	Add containers to launchedContainers list in RMNode on container recovery	Closed	Jian He
24.	Marking ContainerId#getId as deprecated	Closed	Tsuyoshi Ozawa
25.	RM should not recover containers from previously failed attempt when AM restart is not enabled	Closed	Jian He
26.	Possible livelock in CapacityScheduler when RM is recovering apps	Closed	Jian He
27.	Update ConverterUtils#toContainerId to parse epoch	Closed	Tsuyoshi Ozawa
28.	Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId	Closed	Tsuyoshi Ozawa
29.	Add a percentage-node threshold for RM to wait for new allocations after restart/failover	Open	Vinod Kumar Vavilapalli
30.	Distributed shell AM may re-launch containers if RM work preserving restart happens	Resolved	Shane Kumpf
31.	TestWorkPreservingRMRestart: Augment FS tests with queue and headroom checks	Closed	Tsuyoshi Ozawa
32.	NPE when RM tries to transfer state from previous attempt on recovery	Resolved	Jian He
33.	Document work-preserving RM restart	Closed	Jian He
34.	Make work-preserving-recovery the default mechanism for RM recovery	Closed	Jian He

Activity

People

Assignee:: Unassigned

Reporter:: Bikas Saha

Votes:: 0 Vote for this issue

Watchers:: 50 Start watching this issue

Dates

Created:: 08/Apr/13 20:49

Updated:: 27/Jan/17 20:00

Resolved:: 03/May/15 01:03