[YARN-128] [Umbrella] RM Restart Phase 1: State storage and non-work-preserving recovery - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.0-alpha
Fix Version/s: None
Component/s: resourcemanager
Labels:
None

Description

This umbrella jira tracks the work needed to preserve critical state information and reload them upon RM restart.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

YARN-128.patch
18/Oct/12 06:54
92 kB
Devaraj Kavali
YARN-128.old-code-removed.4.patch
13/Nov/12 11:59
123 kB
Bikas Saha
YARN-128.old-code-removed.3.patch
12/Nov/12 05:29
123 kB
Bikas Saha
YARN-128.new-code-added-4.patch
13/Nov/12 11:59
78 kB
Bikas Saha
YARN-128.new-code-added.3.patch
12/Nov/12 05:29
74 kB
Bikas Saha
YARN-128.full-code-4.patch
13/Nov/12 12:00
179 kB
Bikas Saha
YARN-128.full-code.5.patch
20/Nov/12 13:09
255 kB
Bikas Saha
YARN-128.full-code.3.patch
12/Nov/12 05:30
176 kB
Bikas Saha
RMRestartPhase1.pdf
09/Nov/12 18:32
59 kB
Bikas Saha
RM-recovery-initial-thoughts.txt
24/Sep/12 05:06
3 kB
Bikas Saha
restart-zk-store-11-17.patch
17/Nov/12 17:39
61 kB
Bikas Saha
restart-fs-store-11-17.patch
17/Nov/12 17:39
17 kB
Bikas Saha
restart-12-11-zkstore.patch
16/Nov/12 15:56
21 kB
Bikas Saha
MR-4343.1.patch
18/Jun/12 18:09
17 kB
Tsuyoshi Ozawa

Issue Links

depends upon

MAPREDUCE-5086 MR app master deletes staging dir when sent a reboot command from the RM

Closed

is blocked by

MAPREDUCE-5505 Clients should be notified job finished only after job successfully unregistered

Closed

YARN-1082 Secure RM with recovery enabled and rm state store on hdfs fails with gss exception

Closed

YARN-495 Change NM behavior of reboot to resync

Closed

is related to

MAPREDUCE-5476 Job can fail when RM restarts after staging dir is cleaned but before MR successfully unregister with RM

Closed

relates to

MAPREDUCE-5471 Succeed job tries to restart after RMrestart

Resolved

MAPREDUCE-5127 MR job succeeds and exits even when unregister with RM fails

Resolved

MAPREDUCE-5466 Historyserver does not refresh the result of restarted jobs after RM restart

Closed

YARN-1305 RMHAProtocolService#serviceInit should handle HAUtil's IllegalArgumentException

Closed

MAPREDUCE-5472 reducer of sort job restarts from scratch in between after RM restart

Resolved

MAPREDUCE-5567 [Umbrella] Stabilize MR framework w.r.t ResourceManager restart

Resolved

YARN-209 Capacity scheduler doesn't trigger app-activation after adding nodes

Closed

YARN-479 NM retry behavior for connection to RM should be similar for lost heartbeats

Closed

YARN-149 [Umbrella] ResourceManager (RM) Fail-over

Resolved

YARN-218 Distiguish between "failed" and "killed" app attempts

Resolved

YARN-556 [Umbrella] RM Restart phase 2 - Work preserving restart

Resolved

YARN-1139 [Umbrella] Convert all RM components to Services

Open

(12 relates to)

Sub-Tasks

1.	Remove old code for restart	Closed	Bikas Saha
2.	Make changes for RM restart phase 1	Closed	Bikas Saha
3.	Add FS-based persistent store implementation for RMStateStore	Closed	Bikas Saha
4.	Add FileSystem based store for RM	Resolved	Bikas Saha
5.	Security related work for RM restart	Resolved	Bikas Saha
6.	Add Zookeeper-based store implementation for RMStateStore	Closed	Karthik Kambatla
7.	Add HDFS based store for RM which manages the store using directories	Resolved	Jian He
8.	Create common proxy client for communicating with RM	Closed	Jian He
9.	Delayed store operations should not result in RM unavailability for app submission	Closed	Zhijie Shen
10.	AM max attempts is not checked when RM restart and try to recover attempts	Closed	Jian He
11.	Race condition causing RM to potentially relaunch already unregistered AMs on RM restart	Closed	Jian He
12.	NM should reject containers allocated by previous RM	Closed	Jian He
13.	Test and verify that app delegation tokens are added to tokenRenewer after RM restart	Closed	Jian He
14.	Restore appToken and clientToken for app attempt after RM restart	Closed	Jian He
15.	Restore clientToken for app attempt after RM restart	Resolved	Jian He
16.	Restore RMDelegationTokens after RM Restart	Closed	Jian He
17.	RMStateStore's removeApplication APIs should just take an applicationId	Resolved	Tsuyoshi Ozawa
18.	Slow or failing DelegationToken renewals on submission itself make RM unavailable	Closed	Omkar Vinit Joshi
19.	verify that new jobs submitted with old RM delegation tokens after RM restart are accepted	Closed	Jian He
20.	Store completed application information in RM state store	Closed	Jian He
21.	RM crashes if it restarts while the state-store is down	Closed	Jian He
22.	Apps Completed metrics on web UI is not correct after RM restart	Resolved	Jian He
23.	List of applications at NM web UI is inconsistent with applications at RM UI after RM restart	Resolved	Jian He
24.	Change FileSystemRMStateStore to use directories	Closed	Jian He
25.	Document RM Restart feature	Closed	Jian He
26.	"Active users" field in Resourcemanager scheduler UI gives negative values	Resolved	Unassigned
27.	Handle app recovery differently for AM failures and RM restart	Resolved	Unassigned
28.	Recovery issues on RM Restart with FileSystemRMStateStore	Resolved	Karthik Kambatla
29.	Populate AMRMTokens back to AMRMTokenSecretManager after RM restarts	Closed	Jian He
30.	RMStateStore should flush all pending store events before closing	Closed	Jian He
31.	RM may relaunch already KILLED / FAILED jobs after RM restarts	Resolved	Jian He
32.	AM fails to register if RM restarts within 5s of job submission	Resolved	Unassigned
33.	During RM restart, RM should start a new attempt only when previous attempt exits for real	Closed	Omkar Vinit Joshi
34.	Register ClientToken MasterKey in SecretManager after it is saved	Closed	Jian He
35.	Save version information in the state store	Closed	Jian He
36.	FileSystemRMStateStore can leave partial files that prevent subsequent recovery	Closed	Omkar Vinit Joshi
37.	Rethink znode structure for RM HA	Closed	Tsuyoshi Ozawa
38.	Batching optimization for ZKRMStateStore	Resolved	Tsuyoshi Ozawa
39.	Implement a RMStateStore cleaner for deleting application/attempt info	Closed	Jian He
40.	RM hangs on shutdown if calling system.exit in serviceInit or serviceStart	Closed	Jian He
41.	Check time cost for recovering max-app-limit applications	Resolved	Jian He
42.	Change killing application to wait until state store is done	Closed	Jian He
43.	Apps should be saved after it's accepted by the scheduler	Open	Jian He
44.	Fix invalid RMApp transition from NEW to FINAL_SAVING	Closed	Karthik Kambatla
45.	Revisit RMApp transitions from NEW on RECOVER	Resolved	Unassigned
46.	Execessive logging for app and attempts on RM recovery	Open	Unassigned
47.	Work preserving recovery of Unmanged AMs	Resolved	Subramaniam Krishnan
48.	NPE on registerNodeManager if the request has containers for UnmanagedAMs	Closed	Karthik Kambatla
49.	Job stays in PREP state for long time after RM Restarts	Closed	Jian He
50.	Succeeded application remains in accepted after RM restart	Closed	Jian He
51.	Better reporting of finished containers to AMs	Resolved	Unassigned
52.	RM should honor NM heartbeat expiry after RM restart	Open	Unassigned
53.	Move RM recovery related proto to yarn_server_resourcemanager_recovery.proto	Closed	Tsuyoshi Ozawa
54.	Remove ApplicationAttemptState and ApplicationState class in RMStateStore class	Closed	Tsuyoshi Ozawa
55.	Add leveldb-based implementation for RMStateStore	Closed	Jason Darrell Lowe
56.	RMProxy should retry EOFException	Closed	Jian He

Activity

People

Assignee:: Unassigned

Reporter:: Arun Murthy

Votes:: 1 Vote for this issue

Watchers:: 69 Start watching this issue

Dates

Created:: 08/Jun/12 04:05

Updated:: 08/May/15 18:07

Resolved:: 03/May/15 02:29