Details
-
Sub-task
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.3.0
-
None
Description
To support work-preserving NM restart we need to recover the state of the containers when the nodemanager went down. This includes informing the RM of containers that have exited in the interim and a strategy for dealing with the exit codes from those containers along with how to reacquire the active containers and determine their exit codes when they terminate. The state of finished containers also needs to be recovered.
Attachments
Attachments
Issue Links
- contains
-
YARN-1352 Recover LogAggregationService upon nodemanager restart
- Resolved
- is blocked by
-
YARN-1338 Recover localized resource cache state upon nodemanager restart
- Closed
-
YARN-1354 Recover applications upon nodemanager restart
- Closed
- is duplicated by
-
YARN-2040 Recover information about finished containers
- Resolved
- is related to
-
YARN-2402 NM restart: Container recovery for Windows
- In Progress
-
YARN-2561 MR job client cannot reconnect to AM after NM restart.
- Closed