Hadoop YARN
  1. Hadoop YARN
  2. YARN-556

[Umbrella] RM Restart phase 2 - Work preserving restart

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: resourcemanager
    • Labels:
      None
    • Target Version/s:

      Description

      YARN-128 covered storing the state needed for the RM to recover critical information. This umbrella jira will track changes needed to recover the running state of the cluster so that work can be preserved across RM restarts.

      1. Work Preserving RM Restart.pdf
        202 kB
        Bikas Saha
      2. WorkPreservingRestartPrototype.001.patch
        88 kB
        Anubhav Dhoot
      3. YARN-1372.prelim.patch
        53 kB
        Anubhav Dhoot

        Issue Links

          Activity

          Hide
          Vinod Kumar Vavilapalli added a comment -

          Makes sense. Resolved as fixed. Keeping it unassigned given multiple contributors. No fix-version given the tasks spanned across releases.

          Show
          Vinod Kumar Vavilapalli added a comment - Makes sense. Resolved as fixed. Keeping it unassigned given multiple contributors. No fix-version given the tasks spanned across releases.
          Hide
          Bikas Saha added a comment -

          Jian He Anubhav Dhoot Karthik Kambatla Vinod Kumar Vavilapalli Should we resolve this jira as complete?

          Show
          Bikas Saha added a comment - Jian He Anubhav Dhoot Karthik Kambatla Vinod Kumar Vavilapalli Should we resolve this jira as complete?
          Hide
          Santosh Marella added a comment -

          Referencing YARN-2476 here to ensure the specific scenario mentioned there is fixed as part of this JIRA.

          Show
          Santosh Marella added a comment - Referencing YARN-2476 here to ensure the specific scenario mentioned there is fixed as part of this JIRA.
          Hide
          Anubhav Dhoot added a comment -

          NM does not remove completedContainers from its list until RM sends a new field in the nodeheartbeatresponse which tracks containerCompletions acked by the AM.
          RM AppAttempt tracks completed container to nodeid, This is sents to AM and after AM sends the next allocate its assumed to implicitly ack the previous , RMNode gets a new event to process this ack and send it to NM via the heartbeatresponse completing the cycle.

          Show
          Anubhav Dhoot added a comment - NM does not remove completedContainers from its list until RM sends a new field in the nodeheartbeatresponse which tracks containerCompletions acked by the AM. RM AppAttempt tracks completed container to nodeid, This is sents to AM and after AM sends the next allocate its assumed to implicitly ack the previous , RMNode gets a new event to process this ack and send it to NM via the heartbeatresponse completing the cycle.
          Hide
          Tsuyoshi Ozawa added a comment -
          Oh. Forgot to mention that. Anubhav Dhoot offered to split up the prototype into multiple patches, one for each of the sub-tasks. If I understand right, his prototype covers almost all the sub-tasks already created.
          

          Anubhav Dhoot, thanks for your great work. I noticed that you attached a patch on YARN-1367. I'll comment there about the patch.

          Show
          Tsuyoshi Ozawa added a comment - Oh. Forgot to mention that. Anubhav Dhoot offered to split up the prototype into multiple patches, one for each of the sub-tasks. If I understand right, his prototype covers almost all the sub-tasks already created. Anubhav Dhoot , thanks for your great work. I noticed that you attached a patch on YARN-1367 . I'll comment there about the patch.
          Hide
          Tsuyoshi Ozawa added a comment -

          Good point, Bikas. Created YARN-2052 for tracking container id discussion. Anubhav Dhoot, let's discuss there.

          Show
          Tsuyoshi Ozawa added a comment - Good point, Bikas. Created YARN-2052 for tracking container id discussion. Anubhav Dhoot , let's discuss there.
          Hide
          Bikas Saha added a comment -

          Folks please take the discussion for container id to its own jira. Spreading it in the main jira will make it harder to track.

          Show
          Bikas Saha added a comment - Folks please take the discussion for container id to its own jira. Spreading it in the main jira will make it harder to track.
          Hide
          Bikas Saha added a comment -

          After the configurable wait-time, the RM starts accepting RPCs from both new AMs and already existing AMs.

          This is not needed. The AM can be allowed to re-sync after state is recovered from the store. Allocations to the AM may not occur until the threshold elapses. In fact, we want to re-sync the AM's asap so that they dont give up on the RM.

          Existing AMs are expected to resync with the RM, which essentially translates to register followed by an allocate call

          We should keep the option open to use a new API called resync that does exactly that. It may help to make this operation "atomic"

          Show
          Bikas Saha added a comment - After the configurable wait-time, the RM starts accepting RPCs from both new AMs and already existing AMs. This is not needed. The AM can be allowed to re-sync after state is recovered from the store. Allocations to the AM may not occur until the threshold elapses. In fact, we want to re-sync the AM's asap so that they dont give up on the RM. Existing AMs are expected to resync with the RM, which essentially translates to register followed by an allocate call We should keep the option open to use a new API called resync that does exactly that. It may help to make this operation "atomic"
          Hide
          Tsuyoshi Ozawa added a comment -

          If we can break the compatibility about the container id, I think Anubhav's approach has no problem.
          If we cannot do this as Jian He mentioned on YARN-2001, I think epoch idea described here might be used.

          Show
          Tsuyoshi Ozawa added a comment - If we can break the compatibility about the container id, I think Anubhav's approach has no problem. If we cannot do this as Jian He mentioned on YARN-2001 , I think epoch idea described here might be used.
          Hide
          Anubhav Dhoot added a comment -

          clustertimestamp is added to containerId so that containerId after RM restart do not clash with containerId before (as the containerId counter resets to zero in memory).

          The problem is the containerId currently is composed of ApplicationAttemptId + int. The int part comes from a in memory containerIdCounter from AppSchedulingInfo. This gets reset after a RM restart. Without any changes the containerIds for containers allocated after restart would clash with existing containerIds.
          The prototype proposal is to make it ApplicationAttemptId + uniqueid + int where the uniqueid can be a timestamp set by RM. I feel containerId should be an opaque string that YARN app developers don't take a dependency on. Also if we used protobuf serialization/deserialization rules everywhere we could deal with compatibility changes of different YARN code versions.

          Show
          Anubhav Dhoot added a comment - clustertimestamp is added to containerId so that containerId after RM restart do not clash with containerId before (as the containerId counter resets to zero in memory). The problem is the containerId currently is composed of ApplicationAttemptId + int. The int part comes from a in memory containerIdCounter from AppSchedulingInfo. This gets reset after a RM restart. Without any changes the containerIds for containers allocated after restart would clash with existing containerIds. The prototype proposal is to make it ApplicationAttemptId + uniqueid + int where the uniqueid can be a timestamp set by RM. I feel containerId should be an opaque string that YARN app developers don't take a dependency on. Also if we used protobuf serialization/deserialization rules everywhere we could deal with compatibility changes of different YARN code versions.
          Hide
          Karthik Kambatla (Inactive) added a comment -

          Oh. Forgot to mention that. Anubhav Dhoot offered to split up the prototype into multiple patches, one for each of the sub-tasks. If I understand right, his prototype covers almost all the sub-tasks already created.

          Show
          Karthik Kambatla (Inactive) added a comment - Oh. Forgot to mention that. Anubhav Dhoot offered to split up the prototype into multiple patches, one for each of the sub-tasks. If I understand right, his prototype covers almost all the sub-tasks already created.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Also, if there is a general agreement on how patches should go in which order, please create that ordering through JIRA dependencies. Thanks.

          Show
          Vinod Kumar Vavilapalli added a comment - Also, if there is a general agreement on how patches should go in which order, please create that ordering through JIRA dependencies. Thanks.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Tx for the community update, Karthik.

          Also, Jian/Abhinav, can you both please file all the known sub-tasks and assign things to yourselves according as you are working on them rightaway? Other folks like Tsuyoshi Ozawa and Rohith have been requesting repeatedly expressed interest to work on this feature. It'll be great to find stuff for everyone instead of creating all tickets and assigning them to the two of you. Thanks.

          Tsuyoshi Ozawa and Rohith, let others know what you specifically want to work on, if you have something in mind.

          6. clustertimestamp is added to containerId so that containerId after RM restart do not clash with containerId before (as the containerId counter resets to zero in memory)

          I totally missed this line item. Can you throw more detail on what the problem is and what the proposal is? What is done in the prototype patch is a major compatibility issue - I'd like to avoid it if we can.

          Show
          Vinod Kumar Vavilapalli added a comment - Tx for the community update, Karthik. Also, Jian/Abhinav, can you both please file all the known sub-tasks and assign things to yourselves according as you are working on them rightaway? Other folks like Tsuyoshi Ozawa and Rohith have been requesting repeatedly expressed interest to work on this feature. It'll be great to find stuff for everyone instead of creating all tickets and assigning them to the two of you. Thanks. Tsuyoshi Ozawa and Rohith , let others know what you specifically want to work on, if you have something in mind. 6. clustertimestamp is added to containerId so that containerId after RM restart do not clash with containerId before (as the containerId counter resets to zero in memory) I totally missed this line item. Can you throw more detail on what the problem is and what the proposal is? What is done in the prototype patch is a major compatibility issue - I'd like to avoid it if we can.
          Hide
          Karthik Kambatla (Inactive) added a comment -

          For the scheduler-related work itself, the offline sync up thought it would be best to move as much common code as possible to AbstractYarnScheduler. To unblock the restart work at the earliest, we should do it in two phases - the first phase that only pulls out stuff that would make it easier to handle the recovery, and a more comprehensive re-jig later.

          Show
          Karthik Kambatla (Inactive) added a comment - For the scheduler-related work itself, the offline sync up thought it would be best to move as much common code as possible to AbstractYarnScheduler. To unblock the restart work at the earliest, we should do it in two phases - the first phase that only pulls out stuff that would make it easier to handle the recovery, and a more comprehensive re-jig later.
          Hide
          Karthik Kambatla (Inactive) added a comment -

          Anubhav, Bikas, Jian, Vinod and myself synced up offline to discuss some of the action items. The general feedback was that the prototype is mostly good, except for how we want to implement the scheduler changes.

          Control flow of the RM restart recovery process is to look something like this. Please feel free to question the correctness.

          1. RM loads the state from the state store
          2. RM starts all its internal services.
          3. RM starts the scheduler, so it could reconstruct the state from node heartbeats.
          4. RM waits for a configurable period of time to allow a sufficient fraction of nodes to rejoin. Individual schedulers decide on how to handle nodes rejoining after this configurable period and before they expire. The options are to kill containers running on the nodes that came in late or to allow them to continue running and update scheduler data-structures. It sounded like CapacityScheduler would prefer the former and FairScheduler the latter - this can still change later.
          5. After the configurable wait-time, the RM starts accepting RPCs from both new AMs and already existing AMs.
          6. Existing AMs are expected to resync with the RM, which essentially translates to register followed by an allocate call that sends all the outstanding requests. AMRMClient is to handle all this, so user AMs using this automatically benefit from this.

          Other related items:

          1. Currently, NM information is not persisted. We should persist it and reload on restart.
          2. Update the javadoc for RegisterApplicationMasterResponse#getContainersFromPreviousAttempts to say it only gives information at that snapshot. Info on running containers might keep trickling in even after the register - we might need something similar in AllocateResponse to get this information, we might also want to add a degree of confidence to this response that is a fraction of nodes that have reconnected.
          Show
          Karthik Kambatla (Inactive) added a comment - Anubhav, Bikas, Jian, Vinod and myself synced up offline to discuss some of the action items. The general feedback was that the prototype is mostly good, except for how we want to implement the scheduler changes. Control flow of the RM restart recovery process is to look something like this. Please feel free to question the correctness. RM loads the state from the state store RM starts all its internal services. RM starts the scheduler, so it could reconstruct the state from node heartbeats. RM waits for a configurable period of time to allow a sufficient fraction of nodes to rejoin. Individual schedulers decide on how to handle nodes rejoining after this configurable period and before they expire. The options are to kill containers running on the nodes that came in late or to allow them to continue running and update scheduler data-structures. It sounded like CapacityScheduler would prefer the former and FairScheduler the latter - this can still change later. After the configurable wait-time, the RM starts accepting RPCs from both new AMs and already existing AMs. Existing AMs are expected to resync with the RM, which essentially translates to register followed by an allocate call that sends all the outstanding requests. AMRMClient is to handle all this, so user AMs using this automatically benefit from this. Other related items: Currently, NM information is not persisted. We should persist it and reload on restart. Update the javadoc for RegisterApplicationMasterResponse#getContainersFromPreviousAttempts to say it only gives information at that snapshot. Info on running containers might keep trickling in even after the register - we might need something similar in AllocateResponse to get this information, we might also want to add a degree of confidence to this response that is a fraction of nodes that have reconnected.
          Hide
          Jian He added a comment -

          Hi Anubhav,
          Looked at the prototype patch. Regarding the approach, it’s better to have a scheduler-agnostic recovery mechanism with no or minimum scheduler-specific changes, instead of implementing each scheduler specifically. YARN-1368 can be renamed to accommodate the necessary common changes for all schedulers.Also, adding cluster timestamp to the container Id doesn’t seem right and that’ll also break compatibility.

          Show
          Jian He added a comment - Hi Anubhav, Looked at the prototype patch. Regarding the approach, it’s better to have a scheduler-agnostic recovery mechanism with no or minimum scheduler-specific changes, instead of implementing each scheduler specifically. YARN-1368 can be renamed to accommodate the necessary common changes for all schedulers.Also, adding cluster timestamp to the container Id doesn’t seem right and that’ll also break compatibility.
          Hide
          Tsuyoshi Ozawa added a comment -

          Anubhav Dhoot, I glanced over your patch.

          1. Can you split your code into each subtasks? Your patch includes overall changes of this task. We should discuss small points on each subtask JIRA.
          2. IMO, prototype is enough to validate the design. Do you have any additional comments about design docs?

          I'd like to include this feature in 2.5.0(maybe May - June?), so let's work togather

          Show
          Tsuyoshi Ozawa added a comment - Anubhav Dhoot , I glanced over your patch. 1. Can you split your code into each subtasks? Your patch includes overall changes of this task. We should discuss small points on each subtask JIRA. 2. IMO, prototype is enough to validate the design. Do you have any additional comments about design docs? I'd like to include this feature in 2.5.0(maybe May - June?), so let's work togather
          Hide
          Tsuyoshi Ozawa added a comment -

          Anubhav, Thank you for sharing the prototype. I will try it this weekend.

          Show
          Tsuyoshi Ozawa added a comment - Anubhav, Thank you for sharing the prototype. I will try it this weekend.
          Hide
          Anubhav Dhoot added a comment -

          This prototype is a way to understand the overall design and the major issues that need to be addressed and minor details that crop up.
          This is not a substitute to actual code/unit test for each sub task.
          Hopefully this will help a discussion on the approach for overall approach and each sub task.

          In this prototype, the following changes are demonstrated.

          1. Containers that were running when RM restarted, will continue running
          2. NM on resync sends the list of running containers as ContainerReport so they provide container capability (sizes).
          3. AM on resync reregisters instead of shutting down. AM can make further requests after RM restart and they are accepted.
          4. Sample of scheduler changes in FairScheduler. It reregisters the application attempt on recovery. On NM addNode it adds the containers to that applicationAttempt and charges these correctly to the application attempt for tracking usage.
          5. Application and Containers resume their lifecycle with additional transitions to support continuation after recovery.
          6. clustertimestamp is added to containerId so that containerId after RM restart do not clash with containerId before (as the containerId counter resets to zero in memory)
          7. Changes are controlled by flag.

          Not addressed topics  
          1. Key and token changes
          2. AM does not resend requests sent before restart yet. So if the RM restarts after AM has made its request and before RM returns a container, AM is left waiting for allocation. Only new asks made after RM restart work.
          3. Completed container status as per design is not handled yet.

          Readme for running through the prototype

          a) Setup with RM recovery turned on and scheduler set to FairScheduler
          b) Start sleep job with map and reduce such as
          bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.0.0-SNAPSHOT.jar sleep -mt 12000 -rt 12000
          c) Restart RM (yarn-daemon.sh stop/start resourcemanager) and see that containers are not restarted.
          Following 2 scenarios work
          1. restart rm while reduce is running. reduce continues and then application completes successfully. Demonstrates continuation of running containers without restart.
          2. restart rm while map is running. map continues and then reduce executes and then application completes successfully. Demonstrates requesting more resources after restart works in addition to the previous scenario.

          Show
          Anubhav Dhoot added a comment - This prototype is a way to understand the overall design and the major issues that need to be addressed and minor details that crop up. This is not a substitute to actual code/unit test for each sub task. Hopefully this will help a discussion on the approach for overall approach and each sub task. In this prototype, the following changes are demonstrated. 1. Containers that were running when RM restarted, will continue running 2. NM on resync sends the list of running containers as ContainerReport so they provide container capability (sizes). 3. AM on resync reregisters instead of shutting down. AM can make further requests after RM restart and they are accepted. 4. Sample of scheduler changes in FairScheduler. It reregisters the application attempt on recovery. On NM addNode it adds the containers to that applicationAttempt and charges these correctly to the application attempt for tracking usage. 5. Application and Containers resume their lifecycle with additional transitions to support continuation after recovery. 6. clustertimestamp is added to containerId so that containerId after RM restart do not clash with containerId before (as the containerId counter resets to zero in memory) 7. Changes are controlled by flag. Not addressed topics   1. Key and token changes 2. AM does not resend requests sent before restart yet. So if the RM restarts after AM has made its request and before RM returns a container, AM is left waiting for allocation. Only new asks made after RM restart work. 3. Completed container status as per design is not handled yet. Readme for running through the prototype a) Setup with RM recovery turned on and scheduler set to FairScheduler b) Start sleep job with map and reduce such as bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.0.0-SNAPSHOT.jar sleep -mt 12000 -rt 12000 c) Restart RM (yarn-daemon.sh stop/start resourcemanager) and see that containers are not restarted. Following 2 scenarios work 1. restart rm while reduce is running. reduce continues and then application completes successfully. Demonstrates continuation of running containers without restart. 2. restart rm while map is running. map continues and then reduce executes and then application completes successfully. Demonstrates requesting more resources after restart works in addition to the previous scenario.
          Hide
          Karthik Kambatla (Inactive) added a comment -

          We think the prototype would be a validation of the design. Individual sub-tasks will go through the same rigor of unit tests and code review. It would help to add further details to the design or evaluate any minor changes required before committing the sub-tasks.

          Show
          Karthik Kambatla (Inactive) added a comment - We think the prototype would be a validation of the design. Individual sub-tasks will go through the same rigor of unit tests and code review. It would help to add further details to the design or evaluate any minor changes required before committing the sub-tasks.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          I don't see the value of a prototype given we have a mostly concrete design. It's fine to do it, but let's make sure we are not taking shortcuts in the interest of getting a quick & dirty version up.

          Show
          Vinod Kumar Vavilapalli added a comment - I don't see the value of a prototype given we have a mostly concrete design. It's fine to do it, but let's make sure we are not taking shortcuts in the interest of getting a quick & dirty version up.
          Hide
          Karthik Kambatla (Inactive) added a comment -

          Please align with the design doc while prototyping. If the design needs changes then please update the document. The sub-tasks need to follow the design doc so that other folks can follow even if they are not writing the code.

          Yes, that is the idea. The prototype should be mostly ready by end of the week. Will update the document with any minor changes we see are required, along with a prototype.

          The scheduler changes are the most complex piece. But they can come in the end.

          Without the scheduler changes, I am concerned the remaining patches would only break things. The alternative is to have a config to enable work-preserving restart and guard all changes by that config - I am not yet fully convinced of this approach, would we want to leave this config even after the feature is complete?

          Show
          Karthik Kambatla (Inactive) added a comment - Please align with the design doc while prototyping. If the design needs changes then please update the document. The sub-tasks need to follow the design doc so that other folks can follow even if they are not writing the code. Yes, that is the idea. The prototype should be mostly ready by end of the week. Will update the document with any minor changes we see are required, along with a prototype. The scheduler changes are the most complex piece. But they can come in the end. Without the scheduler changes, I am concerned the remaining patches would only break things. The alternative is to have a config to enable work-preserving restart and guard all changes by that config - I am not yet fully convinced of this approach, would we want to leave this config even after the feature is complete?
          Hide
          Bikas Saha added a comment -

          Please align with the design doc while prototyping. If the design needs changes then please update the document. The sub-tasks need to follow the design doc so that other folks can follow even if they are not writing the code.

          Some pieces of this are already underway in trunk (eg. RM not killing the containers on app attempt exit). The scheduler changes are the most complex piece. But they can come in the end. Working on trunk allows breaks/bugs to be caught quicker and forces us to be more methodical in our approach. A branch is useful when its not clear what approach to take or when we know the code is going to be broken across commits. So I would prefer we do this on trunk.

          Show
          Bikas Saha added a comment - Please align with the design doc while prototyping. If the design needs changes then please update the document. The sub-tasks need to follow the design doc so that other folks can follow even if they are not writing the code. Some pieces of this are already underway in trunk (eg. RM not killing the containers on app attempt exit). The scheduler changes are the most complex piece. But they can come in the end. Working on trunk allows breaks/bugs to be caught quicker and forces us to be more methodical in our approach. A branch is useful when its not clear what approach to take or when we know the code is going to be broken across commits. So I would prefer we do this on trunk.
          Hide
          Tsuyoshi Ozawa added a comment -

          Jian He, your approach looks good to me. We can test new features with the updated protocol. About the NM side, we can choose switch on/off the NM resync by using configuration.

          Karthik Kambatla and Anubhav Dhoot, can you attach prototype source code to JIRAs? I'd like to contribute this JIRA and work with you.

          Show
          Tsuyoshi Ozawa added a comment - Jian He , your approach looks good to me. We can test new features with the updated protocol. About the NM side, we can choose switch on/off the NM resync by using configuration. Karthik Kambatla and Anubhav Dhoot , can you attach prototype source code to JIRAs? I'd like to contribute this JIRA and work with you.
          Hide
          Jian He added a comment -

          Or we can work from a branch first and then move to trunk once it's in a good shape.

          Show
          Jian He added a comment - Or we can work from a branch first and then move to trunk once it's in a good shape.
          Hide
          Jian He added a comment -

          IMO, I would prefer work from the protocol changes first, RM can choose to ignore the container statuses reports for the time being. It's not able to test on a real cluster if we make scheduler changes only, since there are no real entities to report the container statuses. If possible, I'd like this happen on trunk since this can be deeply coupled inside RM, we can catch bugs as early as possible and also avoid the merge nightmare. Thoughts?

          Show
          Jian He added a comment - IMO, I would prefer work from the protocol changes first, RM can choose to ignore the container statuses reports for the time being. It's not able to test on a real cluster if we make scheduler changes only, since there are no real entities to report the container statuses. If possible, I'd like this happen on trunk since this can be deeply coupled inside RM, we can catch bugs as early as possible and also avoid the merge nightmare. Thoughts?
          Hide
          Karthik Kambatla (Inactive) added a comment -

          Thanks for posting the design doc, Bikas Saha. Anubhav Dhoot and I have been working on this for the past few days towards an initial prototype, so we get a handle on all the items required.

          In terms of actual work-items (JIRAs), I wonder if it makes sense to work in a branch. Making the AM, NM resync changes without the scheduler changes would break things. We can work on the scheduler changes first, so there is no caller and add resync later, but I suppose that would make it hard to test outside of unit tests.

          Thoughts?

          Show
          Karthik Kambatla (Inactive) added a comment - Thanks for posting the design doc, Bikas Saha . Anubhav Dhoot and I have been working on this for the past few days towards an initial prototype, so we get a handle on all the items required. In terms of actual work-items (JIRAs), I wonder if it makes sense to work in a branch. Making the AM, NM resync changes without the scheduler changes would break things. We can work on the scheduler changes first, so there is no caller and add resync later, but I suppose that would make it hard to test outside of unit tests. Thoughts?
          Hide
          Bikas Saha added a comment -

          Added some coarse grained tasks based on the attached proposal. More tasks may be added as details get dissected.

          Show
          Bikas Saha added a comment - Added some coarse grained tasks based on the attached proposal. More tasks may be added as details get dissected.
          Hide
          Tsuyoshi Ozawa added a comment -

          Good news, thank you for sharing.

          Show
          Tsuyoshi Ozawa added a comment - Good news, thank you for sharing.
          Hide
          Bikas Saha added a comment -

          Thanks for the reminder. Based on the attached proposal, I am going to create sub-tasks of this jira. Contributors are free to pick up those tasks.

          Show
          Bikas Saha added a comment - Thanks for the reminder. Based on the attached proposal, I am going to create sub-tasks of this jira. Contributors are free to pick up those tasks.
          Hide
          Tsuyoshi Ozawa added a comment -

          Hi Bikas, can you share the current state about this JIRA?

          Show
          Tsuyoshi Ozawa added a comment - Hi Bikas, can you share the current state about this JIRA?
          Hide
          Bikas Saha added a comment -

          Attaching a proposal with details. I may have missed writing something even though I thought of it or may have missed something altogether. Will incorporate feedback as it comes. Will soon start creating sub-tasks that make sense in a chronological ordering of work. Making incremental progress while keeping the RM stable is the desired course of action (like YARN-128).

          Show
          Bikas Saha added a comment - Attaching a proposal with details. I may have missed writing something even though I thought of it or may have missed something altogether. Will incorporate feedback as it comes. Will soon start creating sub-tasks that make sense in a chronological ordering of work. Making incremental progress while keeping the RM stable is the desired course of action (like YARN-128 ).
          Hide
          Bikas Saha added a comment -

          I will be shortly posting a design/road-map document. If anyone has ideas/notes then please start posting them so that I can consolidate.

          Show
          Bikas Saha added a comment - I will be shortly posting a design/road-map document. If anyone has ideas/notes then please start posting them so that I can consolidate.
          Hide
          Bikas Saha added a comment -

          We should be actively avoiding high volume writes to the state store. It makes everything simpler.

          Show
          Bikas Saha added a comment - We should be actively avoiding high volume writes to the state store. It makes everything simpler.
          Hide
          Junping Du added a comment -

          Hi Bikas, beside resending resource request through AM-RM hearbeat, do we think some other way like: putting resource requests to ZK store?

          Show
          Junping Du added a comment - Hi Bikas, beside resending resource request through AM-RM hearbeat, do we think some other way like: putting resource requests to ZK store?
          Hide
          Bikas Saha added a comment -

          Adding brief description of proposal from YARN-128 design document for Google Summer of Code.

          Work preserving restart - RM will have to make the RMAppAttempt state machine enter 
          the Running state before starting the internal services. The RM can obtain information 
          about all running containers from the NM’s when the NM’s heartbeat with it. This 
          information can be used to repopulate the allocation state of scheduler. When the 
          running AM’s heartbeat with RM then the RM can ask them to resend their container 
          requests so that the RM can repopulate all the pending requests. Repopulating the 
          running container and pending container information completes all the data needed by 
          the RM to start normal operations.
          
          Show
          Bikas Saha added a comment - Adding brief description of proposal from YARN-128 design document for Google Summer of Code. Work preserving restart - RM will have to make the RMAppAttempt state machine enter the Running state before starting the internal services. The RM can obtain information about all running containers from the NM’s when the NM’s heartbeat with it. This information can be used to repopulate the allocation state of scheduler. When the running AM’s heartbeat with RM then the RM can ask them to resend their container requests so that the RM can repopulate all the pending requests. Repopulating the running container and pending container information completes all the data needed by the RM to start normal operations.

            People

            • Assignee:
              Unassigned
              Reporter:
              Bikas Saha
            • Votes:
              0 Vote for this issue
              Watchers:
              43 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development