Hadoop YARN
  1. Hadoop YARN
  2. YARN-291

[Umbrella] Dynamic resource configuration

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: nodemanager, scheduler
    • Labels:

      Description

      The current Hadoop YARN resource management logic assumes per node resource is static during the lifetime of the NM process. Allowing run-time configuration on per node resource will give us finer granularity of resource elasticity. This allows Hadoop workloads to coexist with other workloads on the same hardware efficiently, whether or not the environment is virtualized. More background and design details can be found in attached proposal.

      1. YARN-291-all-v1.patch
        66 kB
        Junping Du
      2. YARN-291-core-HeartBeatAndScheduler-01.patch
        31 kB
        Junping Du
      3. YARN-291-JMXInterfaceOnNM-02.patch
        4 kB
        Junping Du
      4. YARN-291-AddClientRMProtocolToSetNodeResource-03.patch
        25 kB
        Junping Du
      5. YARN-291-YARNClientCommandline-04.patch
        8 kB
        Junping Du
      6. YARN-291-OnlyUpdateWhenResourceChange-01-fix.patch
        6 kB
        Junping Du
      7. Elastic Resources for YARN-v0.2.pdf
        512 kB
        Junping Du
      8. YARN-291-CoreAndAdmin.patch
        46 kB
        Junping Du

        Issue Links

          Activity

          Hide
          Vinod Kumar Vavilapalli added a comment -

          Interesting. Quickly read throw your doc, can you please post YARN only design here?

          • +1 for the reactive resource change to start with.
          • Instead of exposing via JMX, being a fairly fundamental feature, we should do RPC and optionally the web-services to the NM directly. And we should add command line util too.
          Show
          Vinod Kumar Vavilapalli added a comment - Interesting. Quickly read throw your doc, can you please post YARN only design here? +1 for the reactive resource change to start with. Instead of exposing via JMX, being a fairly fundamental feature, we should do RPC and optionally the web-services to the NM directly. And we should add command line util too.
          Hide
          Junping Du added a comment -

          Thanks Vinod for great comments. I will update design doc with addressing your comments soon.

          Show
          Junping Du added a comment - Thanks Vinod for great comments. I will update design doc with addressing your comments soon.
          Hide
          Junping Du added a comment -

          Add a whole patch, which include: HeartBeat and Scheduler change, JMX interface with SetResource on NodeManager, add SetNodeResource on ClientRMProtocol and implement commandline of update resource on NodeCLI.

          Show
          Junping Du added a comment - Add a whole patch, which include: HeartBeat and Scheduler change, JMX interface with SetResource on NodeManager, add SetNodeResource on ClientRMProtocol and implement commandline of update resource on NodeCLI.
          Hide
          Junping Du added a comment -

          Basically, there are two interfaces of updating NM's resources in current implementation:

          • YARN RPC through ClientRMProtocol to update resource on RM (RMNode and SchedulerNode) directly (without tell NM). (In fact, this no need the change of Heartbeat)
          • JMX on NodeManager, the resource is updated on NM then going to RM through heartbeat.
            There are some other options as well:
            1) Add NMAdminProtocol, so NM have RPCServer to accept resource updating through local commandline then forward to RM through heartbeat.
            2) Add RMNMProtocol (NM is server), so request through ClientRMProtocol (RM) can go to NM (no need to change heartbeat also).
            3) Add REST API on NM with put() method on resource change.
            1) and 2) seems heavy to me, 3) seems like a reasonable work for next step?
          Show
          Junping Du added a comment - Basically, there are two interfaces of updating NM's resources in current implementation: YARN RPC through ClientRMProtocol to update resource on RM (RMNode and SchedulerNode) directly (without tell NM). (In fact, this no need the change of Heartbeat) JMX on NodeManager, the resource is updated on NM then going to RM through heartbeat. There are some other options as well: 1) Add NMAdminProtocol, so NM have RPCServer to accept resource updating through local commandline then forward to RM through heartbeat. 2) Add RMNMProtocol (NM is server), so request through ClientRMProtocol (RM) can go to NM (no need to change heartbeat also). 3) Add REST API on NM with put() method on resource change. 1) and 2) seems heavy to me, 3) seems like a reasonable work for next step?
          Hide
          Junping Du added a comment -

          Divide patch into 4 patches (01 - 04) for easily review:

          • 01 patch is most Heartbeat and scheduler changes
          • 02 patch is JMX implementation on NodeManager for set/get node's resource
          • 03 patch is add interface through ClientRMProtocol to set node's resource
          • 04 patch is add CLI based on ClientRMProtocol to get/set node's resource.
            Please note that 01 can work independent with 02, or 03 and 04 (need to remove heartbeat change).
            I will add more unit tests in each sub-patch later.
          Show
          Junping Du added a comment - Divide patch into 4 patches (01 - 04) for easily review: 01 patch is most Heartbeat and scheduler changes 02 patch is JMX implementation on NodeManager for set/get node's resource 03 patch is add interface through ClientRMProtocol to set node's resource 04 patch is add CLI based on ClientRMProtocol to get/set node's resource. Please note that 01 can work independent with 02, or 03 and 04 (need to remove heartbeat change). I will add more unit tests in each sub-patch later.
          Hide
          Junping Du added a comment -

          Add 01-fix patch so resource in NM-RM heartbeat is optional only when resource is changed. With this patch, 01 patch can work with 02 patch and 03,04 patch at the same time.

          Show
          Junping Du added a comment - Add 01-fix patch so resource in NM-RM heartbeat is optional only when resource is changed. With this patch, 01 patch can work with 02 patch and 03,04 patch at the same time.
          Hide
          Junping Du added a comment -

          Update proposal (previous attached in HADOOP-9165) to include YARN part only and adding implementation details sync up with recently patches.

          Show
          Junping Du added a comment - Update proposal (previous attached in HADOOP-9165 ) to include YARN part only and adding implementation details sync up with recently patches.
          Hide
          Bikas Saha added a comment -

          Junping, it would be helpful if you attach the patches all together when you are looking for review comments. And leave a comment saying how to go about reviewing the patches. eg. order in which to read the patches and what they contain. If there is 1-1 mapping between the doc and your patches then the doc is sufficient. If not, then a comment explaining the patches would be great.

          Show
          Bikas Saha added a comment - Junping, it would be helpful if you attach the patches all together when you are looking for review comments. And leave a comment saying how to go about reviewing the patches. eg. order in which to read the patches and what they contain. If there is 1-1 mapping between the doc and your patches then the doc is sufficient. If not, then a comment explaining the patches would be great.
          Hide
          Luke Lu added a comment -

          A simpler and more straightforward alternative would be changing the node resource on RM directly via JMX. No protocol changes would be necessary. Changing resource on NM and propagating the change to RM would lead to nasty race conditions where RM still thinks that an NM has enough resource and schedule new containers on the already downsized NM, which should fail and RM would need to reschedule the containers elsewhere, which is not pretty, especially when large number of NMs are affected, even if all the corner cases are handled.

          we should do RPC and optionally the web-services to the NM directly. And we should add command line util too.

          Since the only near term user of the feature is external admin processes that prefer not having Hadoop jar dependencies, JMX change alone (a much smaller patch that doesn't impact major production code paths) would suffice. YARN RPC changes can be in a separate JIRA, when YARN itself needs to change per node resources. I'm not sure how useful a YARN CLI interface for this is, given it's impractical for people to manually change per node resources (when number of nodes is greater than 10). OTOH, I'm fine with having everything in, as long as it doesn't delay the inclusion of the basic functionality.

          Show
          Luke Lu added a comment - A simpler and more straightforward alternative would be changing the node resource on RM directly via JMX. No protocol changes would be necessary. Changing resource on NM and propagating the change to RM would lead to nasty race conditions where RM still thinks that an NM has enough resource and schedule new containers on the already downsized NM, which should fail and RM would need to reschedule the containers elsewhere, which is not pretty, especially when large number of NMs are affected, even if all the corner cases are handled. we should do RPC and optionally the web-services to the NM directly. And we should add command line util too. Since the only near term user of the feature is external admin processes that prefer not having Hadoop jar dependencies, JMX change alone (a much smaller patch that doesn't impact major production code paths) would suffice. YARN RPC changes can be in a separate JIRA, when YARN itself needs to change per node resources. I'm not sure how useful a YARN CLI interface for this is, given it's impractical for people to manually change per node resources (when number of nodes is greater than 10). OTOH, I'm fine with having everything in, as long as it doesn't delay the inclusion of the basic functionality.
          Hide
          Junping Du added a comment -

          Hi Bikas, Thanks for your comments. The biggest patch (YARN-291-all.patch) is the with all patches together. The doc I attached recently have implement part which have 1-1 mapping to sub patches (01, 01-fix, 02, 03, 04). Hope this will be helpful for review. Thanks!

          Show
          Junping Du added a comment - Hi Bikas, Thanks for your comments. The biggest patch ( YARN-291 -all.patch) is the with all patches together. The doc I attached recently have implement part which have 1-1 mapping to sub patches (01, 01-fix, 02, 03, 04). Hope this will be helpful for review. Thanks!
          Hide
          Junping Du added a comment -

          Luke, Thanks for comments. I agree changing node resource RM directly via JMX is a simpler way as no protocol (NM-RM heartbeat or Client-RM) changes. However, I don't understand your comments below:
          "Changing resource on NM and propagating the change to RM would lead to nasty race conditions where RM still thinks that an NM has enough resource and schedule new containers on the already downsized NM, which should fail and RM would need to reschedule the containers elsewhere"
          As RM's scheduling event is trigger by NM heartbeat (NM->RM' ResourceTrackerService-> RMNode (via RMNodeStatusEvent) -> Scheduler (via NodeUpdateSchedulerEvent)), so I think no matter change on NM or RM, the resource update reflect on assigning container will both happen on next Heartbeat update between NM-RM, and I don't see race conditions here.
          I think we can allow resource changes happen both on RM and NM but they have to be consistent after exchanging heartbeat. Allowing resource changes on RM (not matter JMX or RPC, in fact, rpc is not too much complicated as 03 patch shows), like you said, is simple and straightforward. On the other side, I see there is already a dummy NodeResourceMonitor in NodeManager and I remember there is JIRA (forget the number) to talking about detecting OS's resource rather than static configuration. So I think allowing resource changes on NM is also a good thing to go (at least heartbeat with resource info can benefit other work). Thoughts?

          Show
          Junping Du added a comment - Luke, Thanks for comments. I agree changing node resource RM directly via JMX is a simpler way as no protocol (NM-RM heartbeat or Client-RM) changes. However, I don't understand your comments below: "Changing resource on NM and propagating the change to RM would lead to nasty race conditions where RM still thinks that an NM has enough resource and schedule new containers on the already downsized NM, which should fail and RM would need to reschedule the containers elsewhere" As RM's scheduling event is trigger by NM heartbeat (NM->RM' ResourceTrackerService-> RMNode (via RMNodeStatusEvent) -> Scheduler (via NodeUpdateSchedulerEvent)), so I think no matter change on NM or RM, the resource update reflect on assigning container will both happen on next Heartbeat update between NM-RM, and I don't see race conditions here. I think we can allow resource changes happen both on RM and NM but they have to be consistent after exchanging heartbeat. Allowing resource changes on RM (not matter JMX or RPC, in fact, rpc is not too much complicated as 03 patch shows), like you said, is simple and straightforward. On the other side, I see there is already a dummy NodeResourceMonitor in NodeManager and I remember there is JIRA (forget the number) to talking about detecting OS's resource rather than static configuration. So I think allowing resource changes on NM is also a good thing to go (at least heartbeat with resource info can benefit other work). Thoughts?
          Hide
          Luke Lu added a comment -

          Regarding the race condition, I meant that you could change the resource on NM in the middle of a heartbeat, where RM already assigned some containers to NM from the last heartbeat but you don't actually have the resource to launch these containers. It's relatively harmless now as NM is not enforcing its resource limit when launching containers, but the race condition can rear its ugly head if someone later decides to add code to enforce the limit when adding more resource features for things like GPU etc. The race condition would not be obvious to later maintainers.

          With the RM only approach, I think that it's OK/harmless not to change totalResource at all at NM side and leave it as the original resource limit, as long as we set the resource limit <= the original limit, even if limit enforcement is added later. This makes the change an order of magnitude smaller and less invasive, as it doesn't have to change the node update code paths. It would also be wasteful to heartbeat the node resource (especially when later enhanced to including features besides memory) that doesn't change most of the time.

          Show
          Luke Lu added a comment - Regarding the race condition, I meant that you could change the resource on NM in the middle of a heartbeat, where RM already assigned some containers to NM from the last heartbeat but you don't actually have the resource to launch these containers. It's relatively harmless now as NM is not enforcing its resource limit when launching containers, but the race condition can rear its ugly head if someone later decides to add code to enforce the limit when adding more resource features for things like GPU etc. The race condition would not be obvious to later maintainers. With the RM only approach, I think that it's OK/harmless not to change totalResource at all at NM side and leave it as the original resource limit, as long as we set the resource limit <= the original limit, even if limit enforcement is added later. This makes the change an order of magnitude smaller and less invasive, as it doesn't have to change the node update code paths. It would also be wasteful to heartbeat the node resource (especially when later enhanced to including features besides memory) that doesn't change most of the time.
          Hide
          Junping Du added a comment -

          Thanks Luke for great comments! +1 for RM only approach to go first, we can go for NM approach later.
          As incorporating Bikas's comments, I will file several sub JIRA (JMX, RPC, CLI) to make it clear.

          Show
          Junping Du added a comment - Thanks Luke for great comments! +1 for RM only approach to go first, we can go for NM approach later. As incorporating Bikas's comments, I will file several sub JIRA (JMX, RPC, CLI) to make it clear.
          Hide
          Junping Du added a comment -

          YARN-311 is ready for review. Thanks!

          Show
          Junping Du added a comment - YARN-311 is ready for review. Thanks!
          Hide
          Arun C Murthy added a comment -

          I commented on one of the sub-tasks, but I'm not comfortable with JMX based tricks. We fundamentally need an explicit manner to inform RM of changes to a node and that needs to flow down to schedulers etc.

          Show
          Arun C Murthy added a comment - I commented on one of the sub-tasks, but I'm not comfortable with JMX based tricks. We fundamentally need an explicit manner to inform RM of changes to a node and that needs to flow down to schedulers etc.
          Hide
          Luke Lu added a comment -

          Resource scheduling is fundamentally centralized at RM. The global resource view is currently bootstrapped via the node registration process, which is more of a historical artifact based on convenience, since the resource view can also be constructed directly on RM via an inventory database. It's a round about, inconvenient and inefficient way to (re)construct the resource view by modifying per node config explicitly and propagate partial views to RM, if you already have an inventory database.

          For a practical example: if you have OLTP workload (say HBase) sharing the same hardware with YARN and there is a load surge on HBase, we need to stop scheduling tasks/containers immediately on relevant (potentially all) nodes. The current patch (JMX is just used as a portable protocol for external management client to communicate with RM) can take effect immediately most efficiently. If we explicitly modify each nodemanager config and let the NM-RM protocol to propagate the change to RM, it would waste resource (CPU and network bandwidth) to contact (potentially) all the nodemanagers and cause unnecessary scheduling delays if the propagation is via regular heartbeat and/or DDoS the RM if (potentially) all the NMs need to re-register out-of-band.

          This is not about JMX based tricks. This is about changing global resource view directly where the scheduler is vs the Rube Goldbergish way of changing NM config individually and propagate changes to RM to reconstruct the resource view. IMO, the direct way is better because NM doesn't really care about what resource it really has.

          Show
          Luke Lu added a comment - Resource scheduling is fundamentally centralized at RM. The global resource view is currently bootstrapped via the node registration process, which is more of a historical artifact based on convenience, since the resource view can also be constructed directly on RM via an inventory database. It's a round about, inconvenient and inefficient way to (re)construct the resource view by modifying per node config explicitly and propagate partial views to RM, if you already have an inventory database. For a practical example: if you have OLTP workload (say HBase) sharing the same hardware with YARN and there is a load surge on HBase, we need to stop scheduling tasks/containers immediately on relevant (potentially all) nodes. The current patch (JMX is just used as a portable protocol for external management client to communicate with RM) can take effect immediately most efficiently. If we explicitly modify each nodemanager config and let the NM-RM protocol to propagate the change to RM, it would waste resource (CPU and network bandwidth) to contact (potentially) all the nodemanagers and cause unnecessary scheduling delays if the propagation is via regular heartbeat and/or DDoS the RM if (potentially) all the NMs need to re-register out-of-band. This is not about JMX based tricks. This is about changing global resource view directly where the scheduler is vs the Rube Goldbergish way of changing NM config individually and propagate changes to RM to reconstruct the resource view. IMO, the direct way is better because NM doesn't really care about what resource it really has.
          Hide
          Bikas Saha added a comment -

          because NM doesn't really care about what resource it really has

          I dont quite agree with this. The NM is the node controller and responsible for managing the node resources.

          RM-NM information exchange is a core concept in YARN. I dont think I agree with updating the RM directly and not bothering about what the NM's think or report about their resources. My concern is about NM and RM going out of sync and YARN becoming harder to understand.

          Since the only near term user of the feature is external admin processes that prefer not having Hadoop jar dependencies

          It would help if some concrete scenarios were described. From the document, I get the general notion of cloud services having elastic nodes. If the cloud actually changes the node resources then the NM's should get to know that something has changed and report it. The reactive race condition is always there irrespective of whether NM's report the change or some external entity does. NM's sync with the RM every 1 second. Is that duration too short for the use cases being targeted? The argument in favor of directly changing the RM view makes the case for almost all nodes changing state at once in the cluster. Is that the primary scenario?

          The HBase scenario mentioned in the comments needs more thought. Simply updating the RM with lower NM resource info may not be enough. True, the updated RM will not schedule more work on them. But the point is that if it had more work to schedule and there were free resources then it would have already done so. If resources are free then it means RM does not have work for them. If resources are not free then RM will not schedule more work until the NM heartbeat tells the RM about freshly freed up resources. So I dont see how updating the RM out of band via admin will help.
          Also, the currently running tasks will hamper the HBase case even if future work is not scheduled. What we need is some way to allow HBase higher priority for system resources so that tasks do not impede HBase perf. Something like YARN-443.

          This JIRA is part of HADOOP-9165 but that one has been resolved as invalid.

          Show
          Bikas Saha added a comment - because NM doesn't really care about what resource it really has I dont quite agree with this. The NM is the node controller and responsible for managing the node resources. RM-NM information exchange is a core concept in YARN. I dont think I agree with updating the RM directly and not bothering about what the NM's think or report about their resources. My concern is about NM and RM going out of sync and YARN becoming harder to understand. Since the only near term user of the feature is external admin processes that prefer not having Hadoop jar dependencies It would help if some concrete scenarios were described. From the document, I get the general notion of cloud services having elastic nodes. If the cloud actually changes the node resources then the NM's should get to know that something has changed and report it. The reactive race condition is always there irrespective of whether NM's report the change or some external entity does. NM's sync with the RM every 1 second. Is that duration too short for the use cases being targeted? The argument in favor of directly changing the RM view makes the case for almost all nodes changing state at once in the cluster. Is that the primary scenario? The HBase scenario mentioned in the comments needs more thought. Simply updating the RM with lower NM resource info may not be enough. True, the updated RM will not schedule more work on them. But the point is that if it had more work to schedule and there were free resources then it would have already done so. If resources are free then it means RM does not have work for them. If resources are not free then RM will not schedule more work until the NM heartbeat tells the RM about freshly freed up resources. So I dont see how updating the RM out of band via admin will help. Also, the currently running tasks will hamper the HBase case even if future work is not scheduled. What we need is some way to allow HBase higher priority for system resources so that tasks do not impede HBase perf. Something like YARN-443 . This JIRA is part of HADOOP-9165 but that one has been resolved as invalid.
          Hide
          Luke Lu added a comment -

          The NM is the node controller and responsible for managing the node resources.

          There is actually no/zero code in NM itself (other than picking up resource config and report to RM) so far that cares about how much resource the node has. NM is responsible for managing the containers that scheduler assigns to it. It does not need to know how much resource it really has. Only scheduler needs to care about how much resource a node has. Making NM oblivious to the total resource it has obviate the unnecessary split brain situation.

          if it had more work to schedule and there were free resources then it would have already done so

          The point was about scheduling latency. There could be a window of OLTP surge where the nodes have free resource and large job is submitted. OLTP workload is real time. Also contacting NM to reconfig resource node by node is inefficient and wasteful.

          the currently running tasks will hamper the HBase case even if future work is not scheduled.

          This is orthogonal to scheduling, which this JIRA is about. Management agent can suspend/move/kill the containers if necessary.

          What we need is some way to allow HBase higher priority for system resources so that tasks do not impede HBase perf. Something like YARN-443.

          OS scheduling priority works with CPU and to a degree IO (depending on the IO scheduler) but not memory, which is critical for many OLTP workload.

          This JIRA is part of HADOOP-9165 but that one has been resolved as invalid.

          I don't know what you are trying to say. HADOOP-9165 is resolved due to wrong JIRA component. It could be moved to YARN or MAPREDUCE, but the creator chose to file new ones instead.

          Show
          Luke Lu added a comment - The NM is the node controller and responsible for managing the node resources. There is actually no/zero code in NM itself (other than picking up resource config and report to RM) so far that cares about how much resource the node has. NM is responsible for managing the containers that scheduler assigns to it. It does not need to know how much resource it really has. Only scheduler needs to care about how much resource a node has. Making NM oblivious to the total resource it has obviate the unnecessary split brain situation. if it had more work to schedule and there were free resources then it would have already done so The point was about scheduling latency. There could be a window of OLTP surge where the nodes have free resource and large job is submitted. OLTP workload is real time. Also contacting NM to reconfig resource node by node is inefficient and wasteful. the currently running tasks will hamper the HBase case even if future work is not scheduled. This is orthogonal to scheduling, which this JIRA is about. Management agent can suspend/move/kill the containers if necessary. What we need is some way to allow HBase higher priority for system resources so that tasks do not impede HBase perf. Something like YARN-443 . OS scheduling priority works with CPU and to a degree IO (depending on the IO scheduler) but not memory, which is critical for many OLTP workload. This JIRA is part of HADOOP-9165 but that one has been resolved as invalid. I don't know what you are trying to say. HADOOP-9165 is resolved due to wrong JIRA component. It could be moved to YARN or MAPREDUCE, but the creator chose to file new ones instead.
          Hide
          Alejandro Abdelnur added a comment -

          +1 on the idea of dynamic resource configuration. The sharing w/HBase use case is a scenario we see quite often.

          Though, I have concerns on changing the available NMs capacity directly in the NMs. This would complicate significantly allocation changes/re-computation of resources for running applications.

          NM available resources discovery should be static as it is completely acceptable to restart them without compromising running apps:

          • A NM on startup detects the OS for its CPU/memory and reports this to the RM
            • config properties in the NM, if present, could override OS detection values
          • A NM should be restarted to report a new configuration

          To achieve dynamic resource configuration we should fully leverage YARN resource scheduling framework. For example, this would be done by introducing the concept of 'unmanaged resources'. The idea would be something like this:

          • An (unmanaged) AM requests unmanaged resources from the RM (a new flag in a resource request)
          • The RM assigns the unmanaged resource to a NM
          • The RM notifies the NM of the unmanaged resource assignment
          • The RM notifies the AM of the unmanaged resource assignment
          • The unmanaged assigned resource does not time out in the RM/scheduler due to container not starting (Bikas, our chat in the last yarn meet-up clicked)
          • The AM will, out of band, claim those resources from the corresponding node
          • The resources will remain assigned to the AM until the AM releases them, the AM goes MIA or the resources are preempted

          This together with YARN-392 (enforce locality for a request), will allow an external component to claim cluster resources while enforcing existing YARN resource allocation policies and priorities.

          With this approach the changes in the RM/scheduler and API are quite simple.

          Thoughts?

          Show
          Alejandro Abdelnur added a comment - +1 on the idea of dynamic resource configuration. The sharing w/HBase use case is a scenario we see quite often. Though, I have concerns on changing the available NMs capacity directly in the NMs. This would complicate significantly allocation changes/re-computation of resources for running applications. NM available resources discovery should be static as it is completely acceptable to restart them without compromising running apps: A NM on startup detects the OS for its CPU/memory and reports this to the RM config properties in the NM, if present, could override OS detection values A NM should be restarted to report a new configuration To achieve dynamic resource configuration we should fully leverage YARN resource scheduling framework. For example, this would be done by introducing the concept of 'unmanaged resources'. The idea would be something like this: An (unmanaged) AM requests unmanaged resources from the RM (a new flag in a resource request) The RM assigns the unmanaged resource to a NM The RM notifies the NM of the unmanaged resource assignment The RM notifies the AM of the unmanaged resource assignment The unmanaged assigned resource does not time out in the RM/scheduler due to container not starting (Bikas, our chat in the last yarn meet-up clicked) The AM will, out of band, claim those resources from the corresponding node The resources will remain assigned to the AM until the AM releases them, the AM goes MIA or the resources are preempted This together with YARN-392 (enforce locality for a request), will allow an external component to claim cluster resources while enforcing existing YARN resource allocation policies and priorities. With this approach the changes in the RM/scheduler and API are quite simple. Thoughts?
          Hide
          Luke Lu added a comment -

          I have concerns on changing the available NMs capacity directly in the NMs.

          You probably haven't looked at the patch (YARN-311), which actually changes the global resource view of RM directly. NMs themselves are untouched. Changed title to reduce further confusion.

          This would complicate significantly allocation changes/re-computation of resources for running applications.

          How so? In fact there is no need to change anything in NM, AFAICT. The existing containers can either be left alone until completion or suspended/moved/killed by a management agent.

          An (unmanaged) AM requests unmanaged resources from the RM (a new flag in a resource request)...

          First this requires modifying existing applications to support the new request API, which has significant issues (best-effort, timeout) without preemption.

          Second, RM scheduling is fundamentally high latency and best-effort by design at container granularity. It's not suitable for low latency workload with high memory utilization via fine grained (page-level) memory sharing. Using the HBase example again, you'll need to request an "unmanaged resource" with a container size of at least the peak memory of the region server JVM, which significantly under utilize memory resource, as peak memory size is significantly higher than average memory usage.

          Note, the existing simple patch doesn't change existing API/protocols. It works with any existing unmodified OLTP applications and can achieve maximum resource utilization with mainstream hypervisors.

          Show
          Luke Lu added a comment - I have concerns on changing the available NMs capacity directly in the NMs. You probably haven't looked at the patch ( YARN-311 ), which actually changes the global resource view of RM directly. NMs themselves are untouched. Changed title to reduce further confusion. This would complicate significantly allocation changes/re-computation of resources for running applications. How so? In fact there is no need to change anything in NM, AFAICT. The existing containers can either be left alone until completion or suspended/moved/killed by a management agent. An (unmanaged) AM requests unmanaged resources from the RM (a new flag in a resource request)... First this requires modifying existing applications to support the new request API, which has significant issues (best-effort, timeout) without preemption. Second, RM scheduling is fundamentally high latency and best-effort by design at container granularity. It's not suitable for low latency workload with high memory utilization via fine grained (page-level) memory sharing. Using the HBase example again, you'll need to request an "unmanaged resource" with a container size of at least the peak memory of the region server JVM, which significantly under utilize memory resource, as peak memory size is significantly higher than average memory usage. Note, the existing simple patch doesn't change existing API/protocols. It works with any existing unmodified OLTP applications and can achieve maximum resource utilization with mainstream hypervisors.
          Hide
          Alejandro Abdelnur added a comment -

          How so? In fact ...

          Changes in NM capacity triggered from outside of the regular scheduling would unbalance existing distribution of allocations potentially triggering preemption. You'd need to handle this specially in the RM/scheduler to handle such scenarios.

          First this requires modifying existing applications to support new request API...

          Not really, the API would be augmented in a compatible way to support what I'm describing, existing applications should not be affected as they don't use this new functionality.

          Second, RM scheduling is fundamentally high latency ....

          It depends how you design you AM that handles unmanaged containers. You could request several small resources on peak and then release them as you don't need them.

          Note, the existing simple patch doesn't change existing API/protocols.

          It is adding a new one, that is a change.

          Show
          Alejandro Abdelnur added a comment - How so? In fact ... Changes in NM capacity triggered from outside of the regular scheduling would unbalance existing distribution of allocations potentially triggering preemption. You'd need to handle this specially in the RM/scheduler to handle such scenarios. First this requires modifying existing applications to support new request API... Not really, the API would be augmented in a compatible way to support what I'm describing, existing applications should not be affected as they don't use this new functionality. Second, RM scheduling is fundamentally high latency .... It depends how you design you AM that handles unmanaged containers. You could request several small resources on peak and then release them as you don't need them. Note, the existing simple patch doesn't change existing API/protocols. It is adding a new one, that is a change.
          Hide
          Luke Lu added a comment -

          Changes in NM capacity triggered from outside of the regular scheduling would unbalance existing distribution of allocations potentially triggering preemption. You'd need to handle this specially in the RM/scheduler to handle such scenarios.

          The existing mechanism would/should work by simply killing off containers when necessary. The container fault tolerant mechanism would/should take care of the rest (including preemption). We can do a better job to differentiate the faults induced by preemption, which would be straight-forward if we expose a preemption API, when we get around to implement the preemption feature. If container suspend/resume API is implemented, we can do that as well.

          It depends how you design you AM that handles unmanaged containers. You could request several small resources on peak and then release them as you don't need them.

          This requires many missing features in RM in order to work properly: finer grain OS/application resource metrics, application priority, conflict arbitration, preemption and related security features (mostly related authorization stuff). This approach is also problematic to support coexistence of different instances/versions of YARN on the same physical cluster.

          It is adding a new one, that is a change.

          The change doesn't affect existing/future YARN applications. The management protocol allows existing/future cluster schedulers to expose appropriate resource views to (multiple instances/versions of) YARN in a straight forward manner.

          IMO, the solution is orthogonal and to what you have proposed. It allows any existing non-YARN applications to efficiently coexist with YARN applications without having to write a special AM using "unmanaged resource" API, with no new features "required" in YARN now. In other words, it is a simple solution to allow YARN to coexist with other schedulers (including other instances/versions of YARN) that already have the features people use/want.

          I'd be interested in hearing cases, where our approach "breaks" YARN applications in any way.

          Show
          Luke Lu added a comment - Changes in NM capacity triggered from outside of the regular scheduling would unbalance existing distribution of allocations potentially triggering preemption. You'd need to handle this specially in the RM/scheduler to handle such scenarios. The existing mechanism would/should work by simply killing off containers when necessary. The container fault tolerant mechanism would/should take care of the rest (including preemption). We can do a better job to differentiate the faults induced by preemption, which would be straight-forward if we expose a preemption API, when we get around to implement the preemption feature. If container suspend/resume API is implemented, we can do that as well. It depends how you design you AM that handles unmanaged containers. You could request several small resources on peak and then release them as you don't need them. This requires many missing features in RM in order to work properly: finer grain OS/application resource metrics, application priority, conflict arbitration, preemption and related security features (mostly related authorization stuff). This approach is also problematic to support coexistence of different instances/versions of YARN on the same physical cluster. It is adding a new one, that is a change. The change doesn't affect existing/future YARN applications. The management protocol allows existing/future cluster schedulers to expose appropriate resource views to (multiple instances/versions of) YARN in a straight forward manner. IMO, the solution is orthogonal and to what you have proposed. It allows any existing non-YARN applications to efficiently coexist with YARN applications without having to write a special AM using "unmanaged resource" API, with no new features "required" in YARN now. In other words, it is a simple solution to allow YARN to coexist with other schedulers (including other instances/versions of YARN) that already have the features people use/want. I'd be interested in hearing cases, where our approach "breaks" YARN applications in any way.
          Hide
          Alejandro Abdelnur added a comment -

          I'd be interested in hearing cases, where our approach "breaks" YARN applications in any way.

          Luke, Junping, it would be great if you verify this is not the case as part of your work/test.

          Show
          Alejandro Abdelnur added a comment - I'd be interested in hearing cases, where our approach "breaks" YARN applications in any way. Luke, Junping, it would be great if you verify this is not the case as part of your work/test.
          Hide
          Junping Du added a comment -

          Link to YARN-45 and MAPREDUCE-4584: in our next phase, we may leverage new protocols to release container in narrowing node's resource.

          Show
          Junping Du added a comment - Link to YARN-45 and MAPREDUCE-4584 : in our next phase, we may leverage new protocols to release container in narrowing node's resource.
          Hide
          Junping Du added a comment -

          Hi Alejandro Abdelnur, thanks for comments above. Actually, just like Luke's comments, in this JIRA, we tried to address the case that YARN node's resource is changed by plan as mixing resources with non-yarn applications (like HBase, Drill, etc.) but not a react to app's short-term thrashing behavior. Thus, we don't have to change anything related to app's API, but just provide a way to change node's resource through Admin APIs (Admin Protocol, CLI, REST and JMX). The only special case is: we should deal with case that running Containers' resource larger than NM's total resource (after changed), but it can be handled well by involving a minus value for available resource (we can call it a "debt" resource model) which shows great flexibility of YARN framework. And YARN scheduler just stop to assign containers on over-loaded NM (in "debt") until its resource being balanced again which looks perfect in our previous test (100 virtual nodes on 10 physical servers). In the long term, preemption mechanism can also be involved for releasing containers/resource in some cases. Thoughts?

          Show
          Junping Du added a comment - Hi Alejandro Abdelnur , thanks for comments above. Actually, just like Luke's comments, in this JIRA, we tried to address the case that YARN node's resource is changed by plan as mixing resources with non-yarn applications (like HBase, Drill, etc.) but not a react to app's short-term thrashing behavior. Thus, we don't have to change anything related to app's API, but just provide a way to change node's resource through Admin APIs (Admin Protocol, CLI, REST and JMX). The only special case is: we should deal with case that running Containers' resource larger than NM's total resource (after changed), but it can be handled well by involving a minus value for available resource (we can call it a "debt" resource model) which shows great flexibility of YARN framework. And YARN scheduler just stop to assign containers on over-loaded NM (in "debt") until its resource being balanced again which looks perfect in our previous test (100 virtual nodes on 10 physical servers). In the long term, preemption mechanism can also be involved for releasing containers/resource in some cases. Thoughts?
          Hide
          Alejandro Abdelnur added a comment -

          Junping Du, yes that should do it, at least in theory. We should make sure we have tests that exercise NM capabilities and things don' break.

          One more think (along the lines of a comment I've done in YARN-796). NM get/report their total capacity to the RM from the NM configuration. If you make a change of the total capacity through the rmadmin API, are this changes transient? If yes, what happens if the NM restarts. If not, where do you persist this changes and how do you merge them back when the NM restarts?

          Show
          Alejandro Abdelnur added a comment - Junping Du , yes that should do it, at least in theory. We should make sure we have tests that exercise NM capabilities and things don' break. One more think (along the lines of a comment I've done in YARN-796 ). NM get/report their total capacity to the RM from the NM configuration. If you make a change of the total capacity through the rmadmin API, are this changes transient? If yes, what happens if the NM restarts. If not, where do you persist this changes and how do you merge them back when the NM restarts?
          Hide
          Junping Du added a comment -

          Sure. I will add more unit tests later to cover this case and make sure things don't break.
          Persist runtime change on capacity or not? it depends on how we think the life cycle of these changes. IMO, when NM is restart (no matter by plan or not), it should be seen as a fresh node as we are not sure previous resource setting condition (resource competition between different types of Apps) is still there so better to start with original value. Let me know if you have different thoughts. Thanks!

          Show
          Junping Du added a comment - Sure. I will add more unit tests later to cover this case and make sure things don't break. Persist runtime change on capacity or not? it depends on how we think the life cycle of these changes. IMO, when NM is restart (no matter by plan or not), it should be seen as a fresh node as we are not sure previous resource setting condition (resource competition between different types of Apps) is still there so better to start with original value. Let me know if you have different thoughts. Thanks!
          Hide
          Alejandro Abdelnur added a comment -

          If the NM goes down, on restart the running containers my be recovered or not. this is not the point I was trying to make. What I was trying to say is that if a NM has been lowered on its total capacity at runtime by the proposed API and that capacity is being used by a non-yarn process (ie Hbase), it can be the case that that takend out capacity is still being used by the non-yarn service. you want that API total capacity correction to remain os the non-yarn process is not affected, no?

          Show
          Alejandro Abdelnur added a comment - If the NM goes down, on restart the running containers my be recovered or not. this is not the point I was trying to make. What I was trying to say is that if a NM has been lowered on its total capacity at runtime by the proposed API and that capacity is being used by a non-yarn process (ie Hbase), it can be the case that that takend out capacity is still being used by the non-yarn service. you want that API total capacity correction to remain os the non-yarn process is not affected, no?
          Hide
          Junping Du added a comment -

          I think we are just not sure if non-yarn process is still there after NM come back as there are several different cases. Isn't it? So only admin (or cross-app mgmt software) that can aware it and should take care of it. If it (non-yarn process) is still there, then admin should set a proper capacity value in NM config to reflect the resource situation. If not, then previous dynamic change is not required any more.
          The principles here could be: at any time NM is started, NM is treated as a fresh node and its resource configuration (static) should reflect current resource view of node. We allow dynamic changes happening in node's runtime to adapt with resource changing. Based on these two rules, we can make sure consist resource view in NM registration and runtime.
          Any thoughts?

          Show
          Junping Du added a comment - I think we are just not sure if non-yarn process is still there after NM come back as there are several different cases. Isn't it? So only admin (or cross-app mgmt software) that can aware it and should take care of it. If it (non-yarn process) is still there, then admin should set a proper capacity value in NM config to reflect the resource situation. If not, then previous dynamic change is not required any more. The principles here could be: at any time NM is started, NM is treated as a fresh node and its resource configuration (static) should reflect current resource view of node. We allow dynamic changes happening in node's runtime to adapt with resource changing. Based on these two rules, we can make sure consist resource view in NM registration and runtime. Any thoughts?
          Hide
          Alejandro Abdelnur added a comment -

          Junping Du, in a host you could typically have a NM, a DN and a Hbase region server. If the NM goes down still the DN and the region server can be working just fine. If you change the NM total capacity configuration via the admin API, the expectation would be that that change is remembered and used the next time the NM starts. This would ensure that the other services, DN and region server, do not suffer because the NM forgot its proper configs. No?

          Show
          Alejandro Abdelnur added a comment - Junping Du , in a host you could typically have a NM, a DN and a Hbase region server. If the NM goes down still the DN and the region server can be working just fine. If you change the NM total capacity configuration via the admin API, the expectation would be that that change is remembered and used the next time the NM starts. This would ensure that the other services, DN and region server, do not suffer because the NM forgot its proper configs. No?
          Hide
          Junping Du added a comment -

          Hi Alejandro Abdelnur, as we discussed in Hadoop Summit, I agree we should persist NM's resource change in some cases so I create sub-jira of YARN-998 to address this. Thoughts?

          Show
          Junping Du added a comment - Hi Alejandro Abdelnur , as we discussed in Hadoop Summit, I agree we should persist NM's resource change in some cases so I create sub-jira of YARN-998 to address this. Thoughts?
          Hide
          Arun C Murthy added a comment -

          Not sure where this is heading - I'd support admin APIs to RM to change node configs (ala existing "yarn rmadmin -refreshNodes").

          Show
          Arun C Murthy added a comment - Not sure where this is heading - I'd support admin APIs to RM to change node configs (ala existing "yarn rmadmin -refreshNodes").
          Hide
          Alejandro Abdelnur added a comment -

          Are we talking about an admin call to the RM that would set a resources correction on per node basis and the RM would adjust the NM reported resource capacity based on this correction? This would not require changes in the NMs. And potentially the correction could be done on the node update event before it makes to the scheduler impl, thus transparent to the scheduler impl. And if we want to persist these corrections, this could be done by the RM itself.

          If I got things right I'm OK with the approach.

          Show
          Alejandro Abdelnur added a comment - Are we talking about an admin call to the RM that would set a resources correction on per node basis and the RM would adjust the NM reported resource capacity based on this correction? This would not require changes in the NMs. And potentially the correction could be done on the node update event before it makes to the scheduler impl, thus transparent to the scheduler impl. And if we want to persist these corrections, this could be done by the RM itself. If I got things right I'm OK with the approach.
          Hide
          Junping Du added a comment -

          Hi Arun C Murthy and Alejandro Abdelnur. Yes. Admin API should be our target and I already updated YARN-313 from Node CLI. Above description is exactly what we are doing here. Will sync up patches.

          Show
          Junping Du added a comment - Hi Arun C Murthy and Alejandro Abdelnur . Yes. Admin API should be our target and I already updated YARN-313 from Node CLI. Above description is exactly what we are doing here. Will sync up patches.
          Hide
          Junping Du added a comment -

          Update patch against latest trunk with core changes and admin protocol (previous is Node cli). Will try more tests and add more UT later in sub jira. Updated design doc will come later also.

          Show
          Junping Du added a comment - Update patch against latest trunk with core changes and admin protocol (previous is Node cli). Will try more tests and add more UT later in sub jira. Updated design doc will come later also.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12597454/YARN-291-CoreAndAdmin.patch
          against trunk revision .

          -1 patch. Trunk compilation may be broken.

          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1695//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12597454/YARN-291-CoreAndAdmin.patch against trunk revision . -1 patch . Trunk compilation may be broken. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1695//console This message is automatically generated.
          Hide
          Luke Lu added a comment -

          It'll better if the node resource config is grouped into a repeated field, so that you can config multiple nodes in one RPC, which is the common case for cluster management systems.

          Show
          Luke Lu added a comment - It'll better if the node resource config is grouped into a repeated field, so that you can config multiple nodes in one RPC, which is the common case for cluster management systems.
          Hide
          Junping Du added a comment -

          Hi Luke Lu Thanks for suggestions. Yes. That should be much more efficient comparing with call RPC multiple times for multi-nodes. And we should define new interface on wire, like: NodeResourceInfo to include NodeId and Resource. Thoughts?

          Show
          Junping Du added a comment - Hi Luke Lu Thanks for suggestions. Yes. That should be much more efficient comparing with call RPC multiple times for multi-nodes. And we should define new interface on wire, like: NodeResourceInfo to include NodeId and Resource. Thoughts?
          Hide
          Junping Du added a comment -

          Hi, I start working on sub JIRAs and will address your comments there. Thanks for review and comments!

          Show
          Junping Du added a comment - Hi, I start working on sub JIRAs and will address your comments there. Thanks for review and comments!
          Hide
          Cindy Li added a comment -

          JunPing,

          I'm trying to find out where you handle different cases of commitTimeout, and getting to this in schedulerUtils.java:

          // TODO process resource over-commitment case (allocated containers
          // > total capacity) in different option by getting value of
          // overCommitTimeoutMillis.

          What's your thought on this? Are you working on it?

          In our case, we would need to set overCommitTimeoutMills to be >0, and the capacity(memory/vore) to be zero for node to be put into maintenance. What I would expect is:
          #1. The scheduler stops scheduling new application attempts to the node
          #2. Clean up containers on that node after overCommitTimeoutMills.

          From my understanding, #1 is done once the capacity is set to 0, but what do you plan to put there for #2? YARN-999 seems not solving #2.

          Show
          Cindy Li added a comment - JunPing, I'm trying to find out where you handle different cases of commitTimeout, and getting to this in schedulerUtils.java: // TODO process resource over-commitment case (allocated containers // > total capacity) in different option by getting value of // overCommitTimeoutMillis. What's your thought on this? Are you working on it? In our case, we would need to set overCommitTimeoutMills to be >0, and the capacity(memory/vore) to be zero for node to be put into maintenance. What I would expect is: #1. The scheduler stops scheduling new application attempts to the node #2. Clean up containers on that node after overCommitTimeoutMills. From my understanding, #1 is done once the capacity is set to 0, but what do you plan to put there for #2? YARN-999 seems not solving #2.
          Hide
          Junping Du added a comment -

          Hi Cindy, Thanks for comments. You are right that #2 is needed in both cases (graceful NM decommission and NM maintenance). Previously, YARN-999 is designed to release allocated containers immediately so client may need to trigger it after tracking a timeout. Now, it can be tracked in RM side and triggered in next NM-RM heartbeat after timeout which can be new scope of YARN-999. I plan to work on it after finishing protocol (YARN-312), CLI (YARN-313) and gracefully decommission (YARN-914). If you think #2 blocking your current NM maintenance work, please feel free to take YARN-999 and move it forward. I will help in review, test, bug fix, etc.

          Show
          Junping Du added a comment - Hi Cindy, Thanks for comments. You are right that #2 is needed in both cases (graceful NM decommission and NM maintenance). Previously, YARN-999 is designed to release allocated containers immediately so client may need to trigger it after tracking a timeout. Now, it can be tracked in RM side and triggered in next NM-RM heartbeat after timeout which can be new scope of YARN-999 . I plan to work on it after finishing protocol ( YARN-312 ), CLI ( YARN-313 ) and gracefully decommission ( YARN-914 ). If you think #2 blocking your current NM maintenance work, please feel free to take YARN-999 and move it forward. I will help in review, test, bug fix, etc.
          Hide
          Cindy Li added a comment -

          Junping, just saw your comments on YARN-999 . I can help on it.
          Can you help me understand the use cases/scope of YARN-999 besides graceful decommission. In the code below:

          // TODO process resource over-commitment case (allocated containers
          // > total capacity) in different option by getting value of
          // overCommitTimeoutMillis.

          By different options above, do you mean overCommitTimeoutMills > 0, = 0, <0 ? I want to find out more use cases associated with this setting besides graceful decommission. For example, you mentioned preemption for long running tasks in YARN-999, is that part of or a different use case from graceful decommission?

          Also, about the August patch CoreAndAdmin.patch (in YARN-291) , can you let us know your plan about it because it seems useful for graceful decommission from outside of YARN code.

          Thanks,

          Show
          Cindy Li added a comment - Junping, just saw your comments on YARN-999 . I can help on it. Can you help me understand the use cases/scope of YARN-999 besides graceful decommission. In the code below: // TODO process resource over-commitment case (allocated containers // > total capacity) in different option by getting value of // overCommitTimeoutMillis. By different options above, do you mean overCommitTimeoutMills > 0, = 0, <0 ? I want to find out more use cases associated with this setting besides graceful decommission. For example, you mentioned preemption for long running tasks in YARN-999 , is that part of or a different use case from graceful decommission? Also, about the August patch CoreAndAdmin.patch (in YARN-291 ) , can you let us know your plan about it because it seems useful for graceful decommission from outside of YARN code. Thanks,
          Hide
          Junping Du added a comment -

          > Junping, just saw your comments on YARN-999 . I can help on it.
          Thanks! I plan to finish option without timeout in Dec, so it would be great for you to help on timeout part.
          >By different options above, do you mean overCommitTimeoutMills > 0, = 0, <0 ? I want to find out more use cases associated with this setting besides >graceful decommission. For example, you mentioned preemption for long running tasks in YARN-999, is that part of or a different use case from graceful >decommission?
          Yes. overCommitTimeoutMills value sets different options here. <0 (or just -1) means we tolerant tasks running to the end even under resource over-consumed cases; >=0 means we only tolerant a few time specified in overCommitTimeoutMills. Once timeout, we do aggressive ways (i.e. preemption on assigned containers with frozen or kill tasks) to reclaim resources so that NM's resource can get it balanced again. Graceful decommission is just a special case for this where we always set NM's totalResource to 0 first, so all assigned containers will get released after a timeout (except timeout = -1). If we can set a proper timeout value here, then it will get chance for NM to finish running tasks with intermediate map output get retrieval before decommissioned and that's why we call it "graceful".
          >Also, about the August patch CoreAndAdmin.patch (in YARN-291) , can you let us know your plan about it because it seems useful for graceful
          > decommission from outside of YARN code.
          Most of patches are on the track. YARN-311 (core changes) get checked in, YARN-312 (RPC) get reviewed with +1. Will be there soon.
          Cheers,

          Show
          Junping Du added a comment - > Junping, just saw your comments on YARN-999 . I can help on it. Thanks! I plan to finish option without timeout in Dec, so it would be great for you to help on timeout part. >By different options above, do you mean overCommitTimeoutMills > 0, = 0, <0 ? I want to find out more use cases associated with this setting besides >graceful decommission. For example, you mentioned preemption for long running tasks in YARN-999 , is that part of or a different use case from graceful >decommission? Yes. overCommitTimeoutMills value sets different options here. <0 (or just -1) means we tolerant tasks running to the end even under resource over-consumed cases; >=0 means we only tolerant a few time specified in overCommitTimeoutMills. Once timeout, we do aggressive ways (i.e. preemption on assigned containers with frozen or kill tasks) to reclaim resources so that NM's resource can get it balanced again. Graceful decommission is just a special case for this where we always set NM's totalResource to 0 first, so all assigned containers will get released after a timeout (except timeout = -1). If we can set a proper timeout value here, then it will get chance for NM to finish running tasks with intermediate map output get retrieval before decommissioned and that's why we call it "graceful". >Also, about the August patch CoreAndAdmin.patch (in YARN-291 ) , can you let us know your plan about it because it seems useful for graceful > decommission from outside of YARN code. Most of patches are on the track. YARN-311 (core changes) get checked in, YARN-312 (RPC) get reviewed with +1. Will be there soon. Cheers,

            People

            • Assignee:
              Junping Du
              Reporter:
              Junping Du
            • Votes:
              6 Vote for this issue
              Watchers:
              57 Start watching this issue

              Dates

              • Created:
                Updated:

                Development