Hadoop YARN
  1. Hadoop YARN
  2. YARN-1197

Support changing resources of an allocated container

    Details

    • Type: Task Task
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 2.1.0-beta
    • Fix Version/s: None
    • Labels:
      None

      Description

      The current YARN resource management logic assumes resource allocated to a container is fixed during the lifetime of it. When users want to change a resource
      of an allocated container the only way is releasing it and allocating a new container with expected size.
      Allowing run-time changing resources of an allocated container will give us better control of resource usage in application side

      1. YARN-1197 old-design-docs-patches-for-reference.zip
        813 kB
        Wangda Tan
      2. YARN-1197_Design.pdf
        692 kB
        MENG DING
      3. YARN-1197_Design.2015.07.07.pdf
        700 kB
        MENG DING
      4. YARN-1197_Design.2015.06.24.pdf
        685 kB
        MENG DING

        Issue Links

          Activity

          MENG DING made changes -
          Attachment YARN-1197_Design.2015.07.07.pdf [ 12744085 ]
          Hide
          MENG DING added a comment -

          The design doc has been updated to include a Design Choices section, which summarizes the rationales behind major design decisions.

          Patches to NodeManager side of the implementation have been posted for review, which include:
          YARN-1449: AM-NM protocol changes to support container resizing
          YARN-1645: ContainerManager implementation to support container resizing
          YARN-3867: ContainerImpl changes to support container resizing
          YARN-1643: ContainersMonitor changes to support monitoring/enforcing container resizing
          YARN-1644: RM-NM protocol changes and NodeStatusUpdater implementation to support container resizing
          YARN-3868: ContainerManager recovery for container resizing

          In addition, YARN-3866 has been posted for review to unblock RM/Scheduler changes in YARN-1651 and YARN-1646.

          Show
          MENG DING added a comment - The design doc has been updated to include a Design Choices section, which summarizes the rationales behind major design decisions. Patches to NodeManager side of the implementation have been posted for review, which include: YARN-1449 : AM-NM protocol changes to support container resizing YARN-1645 : ContainerManager implementation to support container resizing YARN-3867 : ContainerImpl changes to support container resizing YARN-1643 : ContainersMonitor changes to support monitoring/enforcing container resizing YARN-1644 : RM-NM protocol changes and NodeStatusUpdater implementation to support container resizing YARN-3868 : ContainerManager recovery for container resizing In addition, YARN-3866 has been posted for review to unblock RM/Scheduler changes in YARN-1651 and YARN-1646 .
          Hide
          MENG DING added a comment -

          Released containers which are in RUNNING state are put in the NodeHeartbeatResponse.containersToCleanup and sent to NM through heartbeat response. After NM receives the list, it forcefully kill these containers. I don't see a logic in the code right now to acknowledge released containers from NM to RM though. In reality, I guess most containers being released by AM will be in ACQUIRED state, not RUNNING state.

          Show
          MENG DING added a comment - Released containers which are in RUNNING state are put in the NodeHeartbeatResponse.containersToCleanup and sent to NM through heartbeat response. After NM receives the list, it forcefully kill these containers. I don't see a logic in the code right now to acknowledge released containers from NM to RM though. In reality, I guess most containers being released by AM will be in ACQUIRED state, not RUNNING state.
          Hide
          MENG DING added a comment -

          Bikas Saha, I will update the design doc with detailed intuition/rationale behind all the design choices based on the discussion in this thread.

          Show
          MENG DING added a comment - Bikas Saha , I will update the design doc with detailed intuition/rationale behind all the design choices based on the discussion in this thread.
          Hide
          Bikas Saha added a comment -

          There has been a lot of discussion that looks like its converging. It would be helpful for the other interested (but not deeply involved) people, if there was an updated design document with details about the agreed upon design. Also, if this document could outline some of the intuition/logic behind the design choices (like going through AM for low latency) then it would super useful. Thanks!

          Show
          Bikas Saha added a comment - There has been a lot of discussion that looks like its converging. It would be helpful for the other interested (but not deeply involved) people, if there was an updated design document with details about the agreed upon design. Also, if this document could outline some of the intuition/logic behind the design choices (like going through AM for low latency) then it would super useful. Thanks!
          Hide
          Wangda Tan added a comment -

          I think we can handle container decreasing similar to existing AM releasing container for now: NM network failure while decreasing is more like a corner case to me, we can add response if it is necessary. And we also need to see if AM releasing container needs similar acknowledgement from NM.

          Show
          Wangda Tan added a comment - I think we can handle container decreasing similar to existing AM releasing container for now: NM network failure while decreasing is more like a corner case to me, we can add response if it is necessary. And we also need to see if AM releasing container needs similar acknowledgement from NM.
          Hide
          Lei Guo added a comment -

          Agreed, this is similar to the other cases you mentioned, In this case, we may need recommend that AM implementation should check/confirm the decrease status after the request.

          Show
          Lei Guo added a comment - Agreed, this is similar to the other cases you mentioned, In this case, we may need recommend that AM implementation should check/confirm the decrease status after the request.
          Hide
          MENG DING added a comment -

          One option to consider is to let NM confirm back with RM when it is done decreasing the container size. If RM doesn't receive confirmation from NM, it will keep sending the decrease message to NM during heartbeat. This is only for the purpose of resource enforcement. From scheduling point of view, as soon as the decrease request is approved in RM, it takes effect immediately.

          I am not sure if this is worth the effort.

          Show
          MENG DING added a comment - One option to consider is to let NM confirm back with RM when it is done decreasing the container size. If RM doesn't receive confirmation from NM, it will keep sending the decrease message to NM during heartbeat. This is only for the purpose of resource enforcement. From scheduling point of view, as soon as the decrease request is approved in RM, it takes effect immediately. I am not sure if this is worth the effort.
          Hide
          MENG DING added a comment -

          Lei Guo, NM will persist the resource decrease in level DB when it receives the decrease message, so if it fails and is restarted, it can recover the correct container size. In the case of network failure, the decrease message will be lost, but it is the same with all other messages in the response (e.g., containers to clean up/remove). In practice, I don't think this is a serious problem, as we assume by the time a user issues the resource decrease request for a container, that container should have already given up the amount of resource.

          Let me know if you have any thoughts or ideas.
          Thanks.

          Show
          MENG DING added a comment - Lei Guo , NM will persist the resource decrease in level DB when it receives the decrease message, so if it fails and is restarted, it can recover the correct container size. In the case of network failure, the decrease message will be lost, but it is the same with all other messages in the response (e.g., containers to clean up/remove). In practice, I don't think this is a serious problem, as we assume by the time a user issues the resource decrease request for a container, that container should have already given up the amount of resource. Let me know if you have any thoughts or ideas. Thanks.
          Hide
          Lei Guo added a comment -

          MENG DING, for the decrease flow via NodeHeartbeatResponseProto to notify NM, how we handle the case on network/NM failure?

          Show
          Lei Guo added a comment - MENG DING , for the decrease flow via NodeHeartbeatResponseProto to notify NM, how we handle the case on network/NM failure?
          MENG DING made changes -
          Attachment YARN-1197_Design.2015.06.24.pdf [ 12741738 ]
          Hide
          MENG DING added a comment -

          The design doc has been updated based on the latest discussion and feedback from all parties. I have added a section for NodeManager recovery as well. Some details still need to be ironed out, which will be covered in sub-tasks. Special thanks to Wangda Tan for all the valuable suggestions offline.

          Show
          MENG DING added a comment - The design doc has been updated based on the latest discussion and feedback from all parties. I have added a section for NodeManager recovery as well. Some details still need to be ironed out, which will be covered in sub-tasks. Special thanks to Wangda Tan for all the valuable suggestions offline.
          Hide
          Sandy Ryza added a comment -

          The latest proposal makes sense to me as well. Thanks Tan, Wangda and MENG DING!

          Show
          Sandy Ryza added a comment - The latest proposal makes sense to me as well. Thanks Tan, Wangda and MENG DING !
          Hide
          Wangda Tan added a comment -

          MENG DING,
          The latest proposal (https://issues.apache.org/jira/browse/YARN-1197?focusedCommentId=14588963&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14588963) makes sense to me.

          We definitely need AllocateResponseProto for container increase token. For decrease result, it is optional, but probably it doesn't hurt to set it anyway.

          I suggest we only include token when it's necessary, we can add token to decrease result when we needed.

          We could just use the existing getContainerStatus() API for doing this polling for now.

          +1, we don't need a new API.

          Sandy Ryza, do you agree with the latest proposal?

          Show
          Wangda Tan added a comment - MENG DING , The latest proposal ( https://issues.apache.org/jira/browse/YARN-1197?focusedCommentId=14588963&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14588963 ) makes sense to me. We definitely need AllocateResponseProto for container increase token. For decrease result, it is optional, but probably it doesn't hurt to set it anyway. I suggest we only include token when it's necessary, we can add token to decrease result when we needed. We could just use the existing getContainerStatus() API for doing this polling for now. +1, we don't need a new API. Sandy Ryza , do you agree with the latest proposal?
          Hide
          MENG DING added a comment -

          Sorry got things messed up.

          Correction:

          We definitely need AllocateResponseProto for container increase token. For decrease result, it is optional, but probably it doesn't hurt to set it anyway.

          Show
          MENG DING added a comment - Sorry got things messed up. Correction: We definitely need AllocateResponseProto for container increase token. For decrease result, it is optional, but probably it doesn't hurt to set it anyway.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          I don't think it's possible for the AM to start using the additional allocation till the NM has updated all it's state - including writing out recovery information for work preserving restart (Thanks Vinod for pointing this out). Seems like that poll/callback will be required - unless the plan is to route this information via the RM.

          We could just use the existing getContainerStatus() API for doing this polling for now.

          Show
          Vinod Kumar Vavilapalli added a comment - I don't think it's possible for the AM to start using the additional allocation till the NM has updated all it's state - including writing out recovery information for work preserving restart (Thanks Vinod for pointing this out). Seems like that poll/callback will be required - unless the plan is to route this information via the RM. We could just use the existing getContainerStatus() API for doing this polling for now.
          Hide
          MENG DING added a comment -

          Wangda Tan, I am certainly OK doing (a). My original frustration was mainly about inconsistency in RM when doing decrease through NM, now that we have all agreed that decrease should go through RM, the problem is gone.

          So here is the latest proposal:

          • Container resource decrease:
            AM -> RM -> NM
          • Container resource increase:
            AM -> RM -> AM(token) -> NM. AM needs to poll status of container before using the additional allocation.
            Of course we need to properly handle token expiration (i.e., NM -> RM communication is needed to unregister the container from the expirer).

          In addition, I do not see a need for any response to be set in the AllocateResponseProto:

          • For resource decrease, we can assume it is always successful.
          • For resource increase, we are now doing polling to see if the increase is successful.

          Let me know if this makes sense.

          Show
          MENG DING added a comment - Wangda Tan , I am certainly OK doing (a). My original frustration was mainly about inconsistency in RM when doing decrease through NM, now that we have all agreed that decrease should go through RM, the problem is gone. So here is the latest proposal: Container resource decrease: AM -> RM -> NM Container resource increase: AM -> RM -> AM(token) -> NM. AM needs to poll status of container before using the additional allocation. Of course we need to properly handle token expiration (i.e., NM -> RM communication is needed to unregister the container from the expirer). In addition, I do not see a need for any response to be set in the AllocateResponseProto : For resource decrease, we can assume it is always successful. For resource increase, we are now doing polling to see if the increase is successful. Let me know if this makes sense.
          Hide
          Wangda Tan added a comment -

          Thanks for comment, Siddharth Seth/Sandy Ryza.

          Now I'm convinced, from two downstream developers' view. +1 to do the AM-RM-AM-NM (a) for increase as the original doc before (b), not sure if (b) is really required, we can do (b) if there's any real use cases.

          More broadly, just because YARN is not good at hitting sub-second latencies doesn't mean that it isn't a design goal. I strongly oppose any argument that uses the current slowness of YARN as a justification for why we should make architectural decisions that could compromise latencies.

          Make sense to me.

          I.e. that an AM can receive an increase from the RM, then issue a decrease to the NM, and then use its increase to get resources it doesn't deserve?

          Yes, if we send increase request to RM, but send decrease request to NM, we need to handle complex inconsistency in RM side. You can take a look at latest design doc for more details.

          I don't think it's possible for the AM to start using the additional allocation till the NM has updated all it's state - including writing out recovery information for work preserving restart (Thanks Vinod for pointing this out). Seems like that poll/callback will be required - unless the plan is to route this information via the RM.

          Maybe we need to wait all increase steps (monitor/cgroup/state-store) finish before using the additional allocation. If a container is 5G, increase to 10G, RM/NM crashes before write to state store, and app starts use 10G. After RM restart/recovery, NM/RM will think the container is 5G, that will be problematic.

          MENG DING, do you agree with doing (a)?

          Show
          Wangda Tan added a comment - Thanks for comment, Siddharth Seth / Sandy Ryza . Now I'm convinced, from two downstream developers' view. +1 to do the AM-RM-AM-NM (a) for increase as the original doc before (b), not sure if (b) is really required, we can do (b) if there's any real use cases. More broadly, just because YARN is not good at hitting sub-second latencies doesn't mean that it isn't a design goal. I strongly oppose any argument that uses the current slowness of YARN as a justification for why we should make architectural decisions that could compromise latencies. Make sense to me. I.e. that an AM can receive an increase from the RM, then issue a decrease to the NM, and then use its increase to get resources it doesn't deserve? Yes, if we send increase request to RM, but send decrease request to NM, we need to handle complex inconsistency in RM side. You can take a look at latest design doc for more details. I don't think it's possible for the AM to start using the additional allocation till the NM has updated all it's state - including writing out recovery information for work preserving restart (Thanks Vinod for pointing this out). Seems like that poll/callback will be required - unless the plan is to route this information via the RM. Maybe we need to wait all increase steps (monitor/cgroup/state-store) finish before using the additional allocation. If a container is 5G, increase to 10G, RM/NM crashes before write to state store, and app starts use 10G. After RM restart/recovery, NM/RM will think the container is 5G, that will be problematic. MENG DING , do you agree with doing (a)?
          Hide
          Siddharth Seth added a comment -

          I would argue that waiting for an NM-RM heartbeat is much worse than waiting for an AM-RM heartbeat. With continuous scheduling, the RM can make decisions in millisecond time, and the AM can regulate its heartbeats according to the application's needs to get fast responses. If an NM-RM heartbeat is involved, the application is at the mercy of the cluster settings, which should be in the multi-second range for large clusters.

          I tend to agree with Sandy's arguments about option a being better in terms of latency - and that we shouldn't be architecting this in a manner which would limit it to the seconds range rather than milliseconds / hundreds of milliseconds when possible.

          It's already possible to get fast allocations - low 100s of milliseconds via a scheduler loop which is delinked from NM heartbeats and a variable AM-RM heartbeat interval, which is under user control rather than being a cluster property.

          There are going to be improvements to the performance of various protocols in YARN. HADOOP-11552 opens up one such option which allows AMs to know about allocations as soon as the scheduler has the made the decision, without a requirement to poll. Of-course - there's plenty of work to be done before that can actually be used

          That said, callbacks on the RPC can be applied at various levels - including NM-RM communication, which can make option b work fast as well. However, it will incur the cost of additional RPC roundtrips. Option a, however, can be fast from the get go with tuning, and also gets better with future enhancements.

          I don't think it's possible for the AM to start using the additional allocation till the NM has updated all it's state - including writing out recovery information for work preserving restart (Thanks Vinod for pointing this out). Seems like that poll/callback will be required - unless the plan is to route this information via the RM.

          Show
          Siddharth Seth added a comment - I would argue that waiting for an NM-RM heartbeat is much worse than waiting for an AM-RM heartbeat. With continuous scheduling, the RM can make decisions in millisecond time, and the AM can regulate its heartbeats according to the application's needs to get fast responses. If an NM-RM heartbeat is involved, the application is at the mercy of the cluster settings, which should be in the multi-second range for large clusters. I tend to agree with Sandy's arguments about option a being better in terms of latency - and that we shouldn't be architecting this in a manner which would limit it to the seconds range rather than milliseconds / hundreds of milliseconds when possible. It's already possible to get fast allocations - low 100s of milliseconds via a scheduler loop which is delinked from NM heartbeats and a variable AM-RM heartbeat interval, which is under user control rather than being a cluster property. There are going to be improvements to the performance of various protocols in YARN. HADOOP-11552 opens up one such option which allows AMs to know about allocations as soon as the scheduler has the made the decision, without a requirement to poll. Of-course - there's plenty of work to be done before that can actually be used That said, callbacks on the RPC can be applied at various levels - including NM-RM communication, which can make option b work fast as well. However, it will incur the cost of additional RPC roundtrips. Option a, however, can be fast from the get go with tuning, and also gets better with future enhancements. I don't think it's possible for the AM to start using the additional allocation till the NM has updated all it's state - including writing out recovery information for work preserving restart (Thanks Vinod for pointing this out). Seems like that poll/callback will be required - unless the plan is to route this information via the RM.
          Hide
          MENG DING added a comment -

          Wangda Tan, if I understand it correctly, in the AllocateResponseProto, we will have something like containers_change_approved and containers_change_completed. The former will be filled with ID/capability of containers whose change requests have been approved by RM. The latter will be filled with ID/capability of containers whose resource change action have been completed in NM. Right?

          Show
          MENG DING added a comment - Wangda Tan , if I understand it correctly, in the AllocateResponseProto , we will have something like containers_change_approved and containers_change_completed . The former will be filled with ID/capability of containers whose change requests have been approved by RM. The latter will be filled with ID/capability of containers whose resource change action have been completed in NM. Right?
          Hide
          MENG DING added a comment -

          Sandy Ryza, by processing both resource decrease and increase request through RM, the original problem that I brought up should not be an issue any more.
          What we are trying to grasp right now is if it is really necessary for the increase action to go through RM->AM->NM. IMHO, if we can eliminate the need for that while still achieving reasonable performance, that would be ideal.

          Show
          MENG DING added a comment - Sandy Ryza , by processing both resource decrease and increase request through RM, the original problem that I brought up should not be an issue any more. What we are trying to grasp right now is if it is really necessary for the increase action to go through RM->AM->NM. IMHO, if we can eliminate the need for that while still achieving reasonable performance, that would be ideal.
          Hide
          Wangda Tan added a comment -

          Thanks MENG DING,

          I think (c) sounds like a very good proposal, it has advantages

          • Latency is better than (a) (If we assume network conditions between AM-RM/RM-NM are same, since RM send response to NM at the same heartbeat).
          • Doesn't expose container token, etc. to AM when increase approved which is not necessary, AM only needs to poll NM about status of changing resource.
          • It can be considered as an additional step of (b). ((c) = (b) + rm_response_to_am_when_increase_approved + am_poll_nm_about_increase_status). Good for planning as well.

          We can have option (b) enabled by default, and use a configuration parameter to turn on option (c) for framework like Spark.

          I think the two can be enabled together, I don't see any conflict between them, AM can poll NM if it doesn't want to wait another NM-RM heartbeat.

          Thoughts? Sandy Ryza, Vinod Kumar Vavilapalli.

          Show
          Wangda Tan added a comment - Thanks MENG DING , I think (c) sounds like a very good proposal, it has advantages Latency is better than (a) (If we assume network conditions between AM-RM/RM-NM are same, since RM send response to NM at the same heartbeat). Doesn't expose container token, etc. to AM when increase approved which is not necessary, AM only needs to poll NM about status of changing resource. It can be considered as an additional step of (b). ((c) = (b) + rm_response_to_am_when_increase_approved + am_poll_nm_about_increase_status). Good for planning as well. We can have option (b) enabled by default, and use a configuration parameter to turn on option (c) for framework like Spark. I think the two can be enabled together, I don't see any conflict between them, AM can poll NM if it doesn't want to wait another NM-RM heartbeat. Thoughts? Sandy Ryza , Vinod Kumar Vavilapalli .
          Hide
          Sandy Ryza added a comment -

          I think this assumes cluster is quite idle, I understand the low latency could be achieved, but it's not guaranteed since we don't support oversubscribing, etc.

          If the cluster is fully contended we certainly won't get this performance. But as long as there is a decent chunk of space, which is common in many settings, we can. The cluster doesn't need to be fully idle by any means.

          More broadly, just because YARN is not good at hitting sub-second latencies doesn't mean that it isn't a design goal. I strongly oppose any argument that uses the current slowness of YARN as a justification for why we should make architectural decisions that could compromise latencies.

          That said, I still don't have a strong grasp on the kind of complexity we're introducing in the AM, so would like to try to understand that before arguing against you further.

          Is the main problem we're grappling still the one Meng brought up here:
          https://issues.apache.org/jira/browse/YARN-1197?focusedCommentId=14556803&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14556803?
          I.e. that an AM can receive an increase from the RM, then issue a decrease to the NM, and then use its increase to get resources it doesn't deserve?

          Or is the idea that, even if we didn't have this JIRA, NMClient is too complicated, and we'd like to reduce that?

          Show
          Sandy Ryza added a comment - I think this assumes cluster is quite idle, I understand the low latency could be achieved, but it's not guaranteed since we don't support oversubscribing, etc. If the cluster is fully contended we certainly won't get this performance. But as long as there is a decent chunk of space, which is common in many settings, we can. The cluster doesn't need to be fully idle by any means. More broadly, just because YARN is not good at hitting sub-second latencies doesn't mean that it isn't a design goal. I strongly oppose any argument that uses the current slowness of YARN as a justification for why we should make architectural decisions that could compromise latencies. That said, I still don't have a strong grasp on the kind of complexity we're introducing in the AM, so would like to try to understand that before arguing against you further. Is the main problem we're grappling still the one Meng brought up here: https://issues.apache.org/jira/browse/YARN-1197?focusedCommentId=14556803&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14556803? I.e. that an AM can receive an increase from the RM, then issue a decrease to the NM, and then use its increase to get resources it doesn't deserve? Or is the idea that, even if we didn't have this JIRA, NMClient is too complicated, and we'd like to reduce that?
          Hide
          MENG DING added a comment -

          Thanks guys for all the comments! I think we all agreed that container decrease request should go through RM, and decrease action will be triggered with RM->NM heartbeat.

          For increase request and action, theoretically option (a) will have better performance. but we are incurring extra complexity for both YARN and application writers. I was wondering if we can consider option (c) which sorts of meet (a) and (b) in the middle:

          1) AM sends increase request to RM
          2) RM allocates the resource and sends the increase token to NM.
          3) RM sends response to AM right away, instead of waiting for NM to confirm that the increase action has been completed.
          4) Upon receiving the response (which indicates that the increase has been triggered), AM should first poll the container status to make sure that the increase is done before taking action to allocate new tasks.

          Option (c) will save one NM-RM heartbeat cycle, and since both option (a) and (c) need to poll container status, their performance will be very close.

          We can have option (b) enabled by default, and use a configuration parameter to turn on option (c) for framework like Spark.

          Do you think if this is worth considering?

          Show
          MENG DING added a comment - Thanks guys for all the comments! I think we all agreed that container decrease request should go through RM, and decrease action will be triggered with RM->NM heartbeat. For increase request and action, theoretically option (a) will have better performance. but we are incurring extra complexity for both YARN and application writers. I was wondering if we can consider option (c) which sorts of meet (a) and (b) in the middle: 1) AM sends increase request to RM 2) RM allocates the resource and sends the increase token to NM. 3) RM sends response to AM right away, instead of waiting for NM to confirm that the increase action has been completed. 4) Upon receiving the response (which indicates that the increase has been triggered), AM should first poll the container status to make sure that the increase is done before taking action to allocate new tasks. Option (c) will save one NM-RM heartbeat cycle, and since both option (a) and (c) need to poll container status, their performance will be very close. We can have option (b) enabled by default, and use a configuration parameter to turn on option (c) for framework like Spark. Do you think if this is worth considering?
          Hide
          Wangda Tan added a comment -

          To clarify: with proper tuning, we can currently get low hundreds of milliseconds without adding any new scheduler features. With the new scheduler feature I'm imagining, we'd only be limited by the RPC + scheduler time, so we could get 10s of milliseconds with proper tuning.

          I think this assumes cluster is quite idle, I understand the low latency could be achieved, but it's not guaranteed since we don't support oversubscribing, etc. If you assume the cluster is very idle, one solution might be holding more resource at the beginning instead of increasing. In real environment, I think the expectation of delay should still be seconds level.

          From YARN's perspective, (b) handles most of logic within YARN daemons (instead of AM), we don't need to consider inconsistency status between RM/AM when doing recovery, that is really what I prefer . I'm not against of doing (a), but I prefer to do that when we have solid foundation for fast scheduling. I'm not sure if there's any resource management platform in production supports that, but some research papers such as Sparrow uses quite different protocol/approach than YARN. I expect there're still some TODO items for YARN to get guaranteed fast scheduling.

          Show
          Wangda Tan added a comment - To clarify: with proper tuning, we can currently get low hundreds of milliseconds without adding any new scheduler features. With the new scheduler feature I'm imagining, we'd only be limited by the RPC + scheduler time, so we could get 10s of milliseconds with proper tuning. I think this assumes cluster is quite idle, I understand the low latency could be achieved, but it's not guaranteed since we don't support oversubscribing, etc. If you assume the cluster is very idle, one solution might be holding more resource at the beginning instead of increasing. In real environment, I think the expectation of delay should still be seconds level. From YARN's perspective, (b) handles most of logic within YARN daemons (instead of AM), we don't need to consider inconsistency status between RM/AM when doing recovery, that is really what I prefer . I'm not against of doing (a), but I prefer to do that when we have solid foundation for fast scheduling. I'm not sure if there's any resource management platform in production supports that, but some research papers such as Sparrow uses quite different protocol/approach than YARN. I expect there're still some TODO items for YARN to get guaranteed fast scheduling.
          Hide
          Sandy Ryza added a comment -

          Regarding complexity in the AM, the NMClient utility so far has been an API that's fairly easy for app developers to interact with. I've used it more than once and had no issues. Would we not be able to handle most of the additional complexity behind it?

          Show
          Sandy Ryza added a comment - Regarding complexity in the AM, the NMClient utility so far has been an API that's fairly easy for app developers to interact with. I've used it more than once and had no issues. Would we not be able to handle most of the additional complexity behind it?
          Hide
          Sandy Ryza added a comment -

          If you consider all now/future optimizations, such as continous-scheduling / scheduler make decision at same AM-RM heart-beat. (b) needs one more NM-RM heart-beat interval. I agree with you, it could be hundreds of milli-seconds (a) vs. multi-seconds (b). when the cluster is idle.

          To clarify: with proper tuning, we can currently get low hundreds of milliseconds without adding any new scheduler features. With the new scheduler feature I'm imagining, we'd only be limited by the RPC + scheduler time, so we could get 10s of milliseconds with proper tuning.

          Show
          Sandy Ryza added a comment - If you consider all now/future optimizations, such as continous-scheduling / scheduler make decision at same AM-RM heart-beat. (b) needs one more NM-RM heart-beat interval. I agree with you, it could be hundreds of milli-seconds (a) vs. multi-seconds (b). when the cluster is idle. To clarify: with proper tuning, we can currently get low hundreds of milliseconds without adding any new scheduler features. With the new scheduler feature I'm imagining, we'd only be limited by the RPC + scheduler time, so we could get 10s of milliseconds with proper tuning.
          Hide
          Wangda Tan added a comment -

          Sandy Ryza,
          Thanks for replying,

          Why does the AM need to poll the NM about increase status before taking action? Does the NM need to do anything other than update its tracking of the resources allotted to the container?

          Yes, NM only needs to update tracking of the resource and cgroups. We cannot assume this can happen immediately, so we cannot put "container increased" to the same RPC. This is same as startContainer, even if launching a container is fast in most cases, AM needs to poll NM after invoked startContainer.

          Would option (b) ever be able to achieve this kind of latency?

          If you consider all now/future optimizations, such as continous-scheduling / scheduler make decision at same AM-RM heart-beat. (b) needs one more NM-RM heart-beat interval. I agree with you, it could be hundreds of milli-seconds (a) vs. multi-seconds (b). when the cluster is idle.

          But I'm wondering do we really need add these complexity to AM before we have mature optimizatons listed above? And also, if the cluster is busier, we cannot expect the delay as well. I tend to do (b) now since it's simpler to app developer to use this feature, I'm open to add AM->NM channel if we have YARN scheduler supports fast scheduling better.

          Show
          Wangda Tan added a comment - Sandy Ryza , Thanks for replying, Why does the AM need to poll the NM about increase status before taking action? Does the NM need to do anything other than update its tracking of the resources allotted to the container? Yes, NM only needs to update tracking of the resource and cgroups. We cannot assume this can happen immediately, so we cannot put "container increased" to the same RPC. This is same as startContainer, even if launching a container is fast in most cases, AM needs to poll NM after invoked startContainer. Would option (b) ever be able to achieve this kind of latency? If you consider all now/future optimizations, such as continous-scheduling / scheduler make decision at same AM-RM heart-beat. (b) needs one more NM-RM heart-beat interval. I agree with you, it could be hundreds of milli-seconds (a) vs. multi-seconds (b). when the cluster is idle. But I'm wondering do we really need add these complexity to AM before we have mature optimizatons listed above? And also, if the cluster is busier, we cannot expect the delay as well. I tend to do (b) now since it's simpler to app developer to use this feature, I'm open to add AM->NM channel if we have YARN scheduler supports fast scheduling better.
          Hide
          Sandy Ryza added a comment -

          Option (a) can occur in the low hundreds of milliseconds if the cluster is tuned properly, independent of cluster size.
          1) Submit increase request to RM. Poll RM 100 milliseconds later after continuous scheduling thread has run in order to pick up the increase token.
          2) Send increase token to NM.

          Why does the AM need to poll the NM about increase status before taking action? Does the NM need to do anything other than update its tracking of the resources allotted to the container?

          Also, it's not unlikely that schedulers will be improved to return the increase token on the same heartbeat that it's requested. So this could all happen in 2 RPCs + a scheduler decision, and no additional wait time. Anything more than this is probably prohibitively expensive for a framework like Spark to submit an increase request before running each task.

          Would option (b) ever be able to achieve this kind of latency?

          Show
          Sandy Ryza added a comment - Option (a) can occur in the low hundreds of milliseconds if the cluster is tuned properly, independent of cluster size. 1) Submit increase request to RM. Poll RM 100 milliseconds later after continuous scheduling thread has run in order to pick up the increase token. 2) Send increase token to NM. Why does the AM need to poll the NM about increase status before taking action? Does the NM need to do anything other than update its tracking of the resources allotted to the container? Also, it's not unlikely that schedulers will be improved to return the increase token on the same heartbeat that it's requested. So this could all happen in 2 RPCs + a scheduler decision, and no additional wait time. Anything more than this is probably prohibitively expensive for a framework like Spark to submit an increase request before running each task. Would option (b) ever be able to achieve this kind of latency?
          Hide
          Wangda Tan added a comment -

          Sandy Ryza,
          I think increasing via AM<>NM and RM<>NM are in very similar range of delay. (multi-seconds for now)

          a. AM<->NM needs 3 stages
          1) AM Get increase token from RM
          2) AM send increase token to NM
          3) Pooling NM about increase status (because we cannot assume increasing can be done in NM side very fast)

          b. RM->NM needs 4 stages
          1) RM send back increasing token to NM
          2) NM doing increase locally
          3) NM report back to RM when increasing done
          4) RM send increase done to AM

          Solution b. has an additional RM->NM heartbeat interval

          Benefits of b. (Some of them also mentioned by Meng)

          • Simpler to AM, only need to know about increase done, don't need to receive token and submit/pool NM.
          • Create a consistency way for application to increase/decrease containers
          • Recovery is simpler, AM only knows increase when its finished, only need to handle 2 component recovery (NM/RM) instead of 3 components (NM/RM/AM)

          Before we have a fast scheduling design/plan (I don't think we can support milli-seconds scheduling for now, too frequent AM heartbeating will overload RM), I don't think add an additional NM->RM heartbeat interval is a big problem.

          Show
          Wangda Tan added a comment - Sandy Ryza , I think increasing via AM< >NM and RM< >NM are in very similar range of delay. (multi-seconds for now) a. AM<->NM needs 3 stages 1) AM Get increase token from RM 2) AM send increase token to NM 3) Pooling NM about increase status (because we cannot assume increasing can be done in NM side very fast) b. RM->NM needs 4 stages 1) RM send back increasing token to NM 2) NM doing increase locally 3) NM report back to RM when increasing done 4) RM send increase done to AM Solution b. has an additional RM->NM heartbeat interval Benefits of b. (Some of them also mentioned by Meng) Simpler to AM, only need to know about increase done, don't need to receive token and submit/pool NM. Create a consistency way for application to increase/decrease containers Recovery is simpler, AM only knows increase when its finished, only need to handle 2 component recovery (NM/RM) instead of 3 components (NM/RM/AM) Before we have a fast scheduling design/plan (I don't think we can support milli-seconds scheduling for now, too frequent AM heartbeating will overload RM), I don't think add an additional NM->RM heartbeat interval is a big problem.
          Hide
          Sandy Ryza added a comment -

          RM still needs to wait for an acknowledgement from NM to confirm that the increase is done before sending out response to AM. This will take two heartbeat cycles, but this is not much worse than giving out a token to AM first, and then letting AM initiating the increase.

          I would argue that waiting for an NM-RM heartbeat is much worse than waiting for an AM-RM heartbeat. With continuous scheduling, the RM can make decisions in millisecond time, and the AM can regulate its heartbeats according to the application's needs to get fast responses. If an NM-RM heartbeat is involved, the application is at the mercy of the cluster settings, which should be in the multi-second range for large clusters.

          Show
          Sandy Ryza added a comment - RM still needs to wait for an acknowledgement from NM to confirm that the increase is done before sending out response to AM. This will take two heartbeat cycles, but this is not much worse than giving out a token to AM first, and then letting AM initiating the increase. I would argue that waiting for an NM-RM heartbeat is much worse than waiting for an AM-RM heartbeat. With continuous scheduling, the RM can make decisions in millisecond time, and the AM can regulate its heartbeats according to the application's needs to get fast responses. If an NM-RM heartbeat is involved, the application is at the mercy of the cluster settings, which should be in the multi-second range for large clusters.
          Hide
          Sandy Ryza added a comment -

          Is my understanding correct that the broader plan is to move stopping containers out of the AM-NM protocol?

          Show
          Sandy Ryza added a comment - Is my understanding correct that the broader plan is to move stopping containers out of the AM-NM protocol?
          Hide
          Vinod Kumar Vavilapalli added a comment -

          The details looks good.

          Let's make sure we handle RM, AM and NM restarts correctly. Also, let's design the RM - NM protocol to be generic and common enough for regular launch/stop and increase/decrease.

          Tx again for driving this!

          Show
          Vinod Kumar Vavilapalli added a comment - The details looks good. Let's make sure we handle RM, AM and NM restarts correctly. Also, let's design the RM - NM protocol to be generic and common enough for regular launch/stop and increase/decrease. Tx again for driving this!
          Hide
          MENG DING added a comment -

          Sandy Ryza, Yes. The key assumption is that by the time the Application Master requests resource decrease from RM for a particular container, that container should have already reduced its resource usage. Therefore, RM can immediately allocate resource to others.

          So to summarize the main idea:

          • Both container resource increase and decrease requests go through RM. This eliminates the race condition where while a container increase is in progress, a decrease for the same container takes place.
          • There is no need for AM-NM protocol anymore. This greatly simplifies the logic for application writers.
          • Resource decrease can happen immediately in RM, and the actual enforce/monitor of the decrease can happen offline, as mentioned by Vinod.
          • Resource increase, on the other hand, needs more thoughts.
            • In the current design, the RM gives out an increase token to be used by AM to initiate the increase on NM. There is no need for this. RM can notify the increase to NM through RM-NM heartbeat response.
            • RM still needs to wait for an acknowledgement from NM to confirm that the increase is done before sending out response to AM. This will take two heartbeat cycles, but this is not much worse than giving out a token to AM first, and then letting AM initiating the increase.
            • Since RM needs to wait for acknowledgement from NM to confirm the increase, we must handle such cases as timeout, NM restart/recovery, etc. So we probably still need to have a container increase token, and token expiration logic for this purpose, but the token will be sent to NM through RM-NM heartbeat protocol. (I am still working out the details)
          Show
          MENG DING added a comment - Sandy Ryza , Yes. The key assumption is that by the time the Application Master requests resource decrease from RM for a particular container, that container should have already reduced its resource usage. Therefore, RM can immediately allocate resource to others. So to summarize the main idea: Both container resource increase and decrease requests go through RM. This eliminates the race condition where while a container increase is in progress, a decrease for the same container takes place. There is no need for AM-NM protocol anymore. This greatly simplifies the logic for application writers. Resource decrease can happen immediately in RM, and the actual enforce/monitor of the decrease can happen offline, as mentioned by Vinod. Resource increase, on the other hand, needs more thoughts. In the current design, the RM gives out an increase token to be used by AM to initiate the increase on NM. There is no need for this. RM can notify the increase to NM through RM-NM heartbeat response. RM still needs to wait for an acknowledgement from NM to confirm that the increase is done before sending out response to AM. This will take two heartbeat cycles, but this is not much worse than giving out a token to AM first, and then letting AM initiating the increase. Since RM needs to wait for acknowledgement from NM to confirm the increase, we must handle such cases as timeout, NM restart/recovery, etc. So we probably still need to have a container increase token, and token expiration logic for this purpose, but the token will be sent to NM through RM-NM heartbeat protocol. (I am still working out the details)
          Hide
          Sandy Ryza added a comment -

          Going through RM directly is better as the RM will immediately know that the resource is available for future allocations

          Is the idea that the RM would make allocations using the space before receiving acknowledgement from the NodeManager that it has resized the container (adjusted cgroups)?

          Show
          Sandy Ryza added a comment - Going through RM directly is better as the RM will immediately know that the resource is available for future allocations Is the idea that the RM would make allocations using the space before receiving acknowledgement from the NodeManager that it has resized the container (adjusted cgroups)?
          Hide
          Vinod Kumar Vavilapalli added a comment -

          We all agreed that due to the complexity of the current design, it is worthwhile to revisit the idea of increasing and decreasing container size both through Resource Manager

          +1 for this idea. Letting this go through NodeManager directly adds too much complexity and difficult to understand semantics for the application writers.

          If I understand correctly, this would be considerable hit to performance

          Sandy Ryza, as I understand, going through NM is in fact a worse solution w.r.t allocation throughput. Going through RM directly is better as the RM will immediately know that the resource is available for future allocations - the decrease on the NM can happen offline. The control flow I expect is

          • the framework/app decides it doesn't need that many resources anymore. By this time, the container already should have given up on the physical resources it doesn't need
          • informs the RM about the required decrement
          • RM informs NM to resize the container (cgroups etc)
          Show
          Vinod Kumar Vavilapalli added a comment - We all agreed that due to the complexity of the current design, it is worthwhile to revisit the idea of increasing and decreasing container size both through Resource Manager +1 for this idea. Letting this go through NodeManager directly adds too much complexity and difficult to understand semantics for the application writers. If I understand correctly, this would be considerable hit to performance Sandy Ryza , as I understand, going through NM is in fact a worse solution w.r.t allocation throughput. Going through RM directly is better as the RM will immediately know that the resource is available for future allocations - the decrease on the NM can happen offline. The control flow I expect is the framework/app decides it doesn't need that many resources anymore. By this time, the container already should have given up on the physical resources it doesn't need informs the RM about the required decrement RM informs NM to resize the container (cgroups etc)
          Hide
          Wangda Tan added a comment -

          Sparking->Spark

          Show
          Wangda Tan added a comment - Sparking->Spark
          Hide
          Wangda Tan added a comment -

          Sparking->Spark

          Show
          Wangda Tan added a comment - Sparking->Spark
          Hide
          Wangda Tan added a comment -

          Sandy Ryza,
          Thanks for coming back .
          I'm not very sure about what's the performance issue you mentioned if decreases goes to RM, what's the expected (ideal) delay in your mind of Sparking releasing resource.

          Show
          Wangda Tan added a comment - Sandy Ryza , Thanks for coming back . I'm not very sure about what's the performance issue you mentioned if decreases goes to RM, what's the expected (ideal) delay in your mind of Sparking releasing resource.
          Hide
          Sandy Ryza added a comment -

          Sorry, I've been quiet here for a while, but I'd be concerned about a design that requires going through the ResourceManager for decreases. If I understand correctly, this would be considerable hit to performance, which could be prohibitive for frameworks like Spark that might use container-resizing for allocating per-task resources.

          Show
          Sandy Ryza added a comment - Sorry, I've been quiet here for a while, but I'd be concerned about a design that requires going through the ResourceManager for decreases. If I understand correctly, this would be considerable hit to performance, which could be prohibitive for frameworks like Spark that might use container-resizing for allocating per-task resources.
          Hide
          MENG DING added a comment -

          Had a very good discussion with Wangda Tan at the Hadoop summit. We all agreed that due to the complexity of the current design, it is worthwhile to revisit the idea of increasing and decreasing container size both through Resource Manager, that would at least eliminate the need for token expiration logic, and also eliminate the need for AM-NM protocol and APIs. I am currently working on the new design, and will post it for review when it is ready.

          Show
          MENG DING added a comment - Had a very good discussion with Wangda Tan at the Hadoop summit. We all agreed that due to the complexity of the current design, it is worthwhile to revisit the idea of increasing and decreasing container size both through Resource Manager, that would at least eliminate the need for token expiration logic, and also eliminate the need for AM-NM protocol and APIs. I am currently working on the new design, and will post it for review when it is ready.
          Hide
          Wangda Tan added a comment -

          MENG DING, just added you to contributor list, you can go ahead and assign JIRAs to you.

          Show
          Wangda Tan added a comment - MENG DING , just added you to contributor list, you can go ahead and assign JIRAs to you.
          Hide
          MENG DING added a comment -

          Just an update, I am currently working on:

          YARN-1449, API in NM side to support change container resource
          YARN-1643, ContainerMonitor changes in NM
          YARN-1510, NMClient

          I will append patches and drive discussions in each ticket.

          Show
          MENG DING added a comment - Just an update, I am currently working on: YARN-1449 , API in NM side to support change container resource YARN-1643 , ContainerMonitor changes in NM YARN-1510 , NMClient I will append patches and drive discussions in each ticket.
          Hide
          MENG DING added a comment -

          Just wanted to add that if dominant resource calculator is being used, it may compare different dimensions between target and current resource, but since we have the restriction that all dimensions must be >= or <= for increase/decrease actions, there should be no conflicting results.

          Show
          MENG DING added a comment - Just wanted to add that if dominant resource calculator is being used, it may compare different dimensions between target and current resource, but since we have the restriction that all dimensions must be >= or <= for increase/decrease actions, there should be no conflicting results.
          Wangda Tan made changes -
          Hide
          Wangda Tan added a comment -

          Made old design docs / patches to a zip file to avoid unnecessary noise.

          Show
          Wangda Tan added a comment - Made old design docs / patches to a zip file to avoid unnecessary noise.
          Wangda Tan made changes -
          Attachment yarn-server-resourcemanager.patch.ver.1 [ 12615533 ]
          Wangda Tan made changes -
          Attachment yarn-server-nodemanager.patch.ver.1 [ 12615532 ]
          Wangda Tan made changes -
          Attachment yarn-server-common.patch.ver.1 [ 12615531 ]
          Wangda Tan made changes -
          Attachment yarn-pb-impl.patch.ver.1 [ 12615530 ]
          Wangda Tan made changes -
          Attachment yarn-api-protocol.patch.ver.1 [ 12615529 ]
          Wangda Tan made changes -
          Attachment yarn-1197-v5.pdf [ 12617834 ]
          Wangda Tan made changes -
          Attachment yarn-1197-v4.pdf [ 12614261 ]
          Wangda Tan made changes -
          Attachment yarn-1197-v3.pdf [ 12607745 ]
          Wangda Tan made changes -
          Attachment yarn-1197-v2.pdf [ 12606770 ]
          Wangda Tan made changes -
          Attachment yarn-1197-scheduler-v1.pdf [ 12618425 ]
          Wangda Tan made changes -
          Attachment yarn-1197.pdf [ 12604489 ]
          Wangda Tan made changes -
          Attachment tools-project.patch.ver.1 [ 12615528 ]
          Wangda Tan made changes -
          Attachment mapreduce-project.patch.ver.1 [ 12615527 ]
          Hide
          MENG DING added a comment -

          Wangda Tan
          Makes sense to me. Will update the doc to include this.

          Show
          MENG DING added a comment - Wangda Tan Makes sense to me. Will update the doc to include this.
          Hide
          MENG DING added a comment -

          Wangda Tan
          Makes sense to me. Will update the doc to include this.

          Show
          MENG DING added a comment - Wangda Tan Makes sense to me. Will update the doc to include this.
          Hide
          Wangda Tan added a comment -

          MENG DING.
          For the comparison of resources, I think for both increase/decrease, it should be >= or <= for all dimensions. But if resource calculator is default, increase v-core makes no sense. So I think ResourceCalculator has to be used, but also needs to check all individual dimensions.

          So the logic will be:

          if (increase): 
             delta = target - now
             if delta.mem < 0 || delta.vcore < 0:
                throw exception
             if resourceCalculator.lessOrEqualThan(delta, 0):
                throw exception
             // .. move forward
          
          Show
          Wangda Tan added a comment - MENG DING . For the comparison of resources, I think for both increase/decrease, it should be >= or <= for all dimensions. But if resource calculator is default, increase v-core makes no sense. So I think ResourceCalculator has to be used, but also needs to check all individual dimensions. So the logic will be: if (increase): delta = target - now if delta.mem < 0 || delta.vcore < 0: throw exception if resourceCalculator.lessOrEqualThan(delta, 0): throw exception // .. move forward
          Hide
          MENG DING added a comment -

          Correct a typo in my previous post, it should be:

          As an example, if a container is currently using 2G, and AM asks to increase its resource to 4G, and then asks again to increase to 6G, but AM doesn't actually use any of the token to increase the resource on NM. In this case, with the current design, RM can only revert the resource allocation back to 4G after expiration, not 2G."

          Forgot to discuss another important piece. We probably should not use the existing ResourceCalculator to compare two resource capabilities in this project, because:

          • The DefaultResourceCalculator only compares memory, which won't work if we want to only change CPU cores.
          • The DominantResourceCalculator may end up comparing different dimensions between two Resources, which doesn't make sense in our project.

          The way to compare two resource in this project should be straightforward as follows. Let me know if you think otherwise.

          • For increase request, no dimension in the target resource can be smaller than the corresponding dimension in the current resource, and at least one dimension in the target resource must be larger than the corresponding dimension in the current resource.
          • For decrease request, no dimension in the target resource can be larger than the corresponding dimension in the current resource, and at least one dimension in the target resource must be smaller than the corresponding dimension in the current resource.
          Show
          MENG DING added a comment - Correct a typo in my previous post, it should be: As an example, if a container is currently using 2G, and AM asks to increase its resource to 4G, and then asks again to increase to 6G, but AM doesn't actually use any of the token to increase the resource on NM. In this case, with the current design, RM can only revert the resource allocation back to 4G after expiration, not 2G." Forgot to discuss another important piece. We probably should not use the existing ResourceCalculator to compare two resource capabilities in this project, because: The DefaultResourceCalculator only compares memory, which won't work if we want to only change CPU cores. The DominantResourceCalculator may end up comparing different dimensions between two Resources, which doesn't make sense in our project. The way to compare two resource in this project should be straightforward as follows. Let me know if you think otherwise. For increase request, no dimension in the target resource can be smaller than the corresponding dimension in the current resource, and at least one dimension in the target resource must be larger than the corresponding dimension in the current resource. For decrease request, no dimension in the target resource can be larger than the corresponding dimension in the current resource, and at least one dimension in the target resource must be smaller than the corresponding dimension in the current resource.
          Hide
          MENG DING added a comment -

          Thanks Vinod Kumar Vavilapalli and Wangda Tan for the great comments!

          To Vinod Kumar Vavilapalli:

          Expanding containers at ACQUIRED state sounds useful in theory. But agree with you that we can punt it for later.

          Thanks for the confirmation

          To your example of concurrent increase/decrease sizing requests from AM, shall we simply say that only one change-in-progress is allowed for any given container?

          Actually we really wanted to be able to achieve this, but with the current asymmetric logic of increasing resource from RM, and decreasing resource from NM, it doesn't seem to be possible The reason is because:

          • The increase action starts from AM requesting the increase from RM, being granted a resource increase token, then initiating the increase action on NM, until finally NM confirming with RM about the increase.
          • Once an increase token has been granted to AM, and before it expires (10 minutes by default), if AM does not initiate the increase action on NM, NM will have no idea that an increase is already in progress.
          • If, at this moment, AM initiates a resource decrease action on NM, NM will go ahead and honor it. So in effect, there can be concurrent decrease/increase action going on, and there doesn't seem to be a way to block this.

          If we do the above, this will also simplify most of the code, as we will simply have the notion of a Change, instead of an explicit increase/decrease everywhere. For e.g., we will just have a ContainerResourceChangeExpirer.

          I believe the ContainerResourceChangeExpirer only applies to the container resource increase action. The container decrease action goes directly through NM so it does not need an expiration logic.

          There will be races with container-states toggling from RUNNING to finished states, depending on when AM requests a size-change and when NMs report that a container finished. We can simply say that the state at the ResourceManager wins.

          Agreed.

          Didn't understand why we need this RM-NM confirmation. The token from RM to AM to NM should be enough for NM to update its view, right?

          This is the same as the reasons listed above.

          Instead of adding new records for ContainerResourceIncrease / decrease in AllocationResponse, should we add a new field in the API record itself stating if it is a New/Increased/Decreased container? If we move to a single change model, it's likely we will not even need this.

          I am open to this suggestion. We could add a field in the existing ContainerProto to indicate if this Container is new/increased/decreased container. The only thing I am not sure is if we can still change the AllocateResponseProto now that the ContainerResourceIncrease/Decrease is already in the trunk?

          Any obviously invalid change-requests should be rejected right-away. For e.g, an increase to more than cluster's max container size. Seemed like you are suggesting we ignore the invalid requests.

          Agreed that any invalid increase requests from AM to RM, and invalid decrease requests from AM to NM should be directly rejected. The 'ignore' case I was referring to is in the context of NodeUpdate from NM to RM.

          Nit: In the design doc, the high-level flow for container-increase point #7 incorrectly talks about decrease instead of increase.

          Yes, this is a mistake, and I will correct it.

          I propose we do this in a branch

          Definitely. There is already a YARN-1197 branch, and we can simply work in that branch.

          To Wangda Tan:

          Actually the appoarch in design doc is this (Meng plz let me know if I misunderstood). In scheduler's implementation, it allows only one pending change request for same container, later change-request will either overwrite prior one or rejected.

          The current design only allows one increase request in the whole system, which is guaranteed by the ContainerResourceIncreaseExpirer object. However, as explained above, we cannot block decrease action while an increase action is still in progress.

          1) For the protocols between servers/AMs, mostly same to previous doc, the biggest change I can see is the ContainerResourceChangeProto in NodeHeartbeatResponseProto, which makes sense to me.

          Yes, the ContainerResourceChangeProto is the biggest change. Glad that you agree with this new protocol

          2) For the client side change: 2.2.1, +1 to option 3.

          Great. I will remove option 1 and option 2 from the design doc.

          3) For 2.3.3.2 scheduling part, The scheduling of an outstanding resource increase request to a container will be skipped if there are either:. Both of the two may not needed since AM can require for more resource when container increase (e.g. container increased to 4G, and AM wants it to be 6G before notify NM).

          Good point, this could be very convenient in practice. However the thing that I have not figured out is how to handle the increase token expiration logic if we have multiple increase actions going on at the same time. The current expiration logic (section 2.3.2 in the design doc) only tracks one increase request for a container (container ID + original capacity for rollback). As an example, if AM is currently using 2G, and asks to increase to 4G, and then asks again to increase to 6G, but AM doesn't actually use any of the token to increase the resource on NM. In this case, RM can only revert the resource allocation back to 4G after expiration, not 2G.

          4) We may not need "reserved increase request", all increase request should be considered to be "reserved". But we still need to respect orders of applications in LeafQueue, no matter it's original FIFO or Fair (added after YARN-3306). We can discuss more scheduling details in separated JIRA.

          For sure. My knowledge in the scheduler side is still very limited, so I will continue to learn along the way.

          By the way, thanks for clearing up the JIRAs. It's great that you are able to work on the RM/Scheduler! I am glad to take any unassigned tasks

          Show
          MENG DING added a comment - Thanks Vinod Kumar Vavilapalli and Wangda Tan for the great comments! To Vinod Kumar Vavilapalli : Expanding containers at ACQUIRED state sounds useful in theory. But agree with you that we can punt it for later. Thanks for the confirmation To your example of concurrent increase/decrease sizing requests from AM, shall we simply say that only one change-in-progress is allowed for any given container? Actually we really wanted to be able to achieve this, but with the current asymmetric logic of increasing resource from RM, and decreasing resource from NM, it doesn't seem to be possible The reason is because: The increase action starts from AM requesting the increase from RM, being granted a resource increase token, then initiating the increase action on NM, until finally NM confirming with RM about the increase. Once an increase token has been granted to AM, and before it expires (10 minutes by default), if AM does not initiate the increase action on NM, NM will have no idea that an increase is already in progress . If, at this moment, AM initiates a resource decrease action on NM, NM will go ahead and honor it. So in effect, there can be concurrent decrease/increase action going on, and there doesn't seem to be a way to block this. If we do the above, this will also simplify most of the code, as we will simply have the notion of a Change, instead of an explicit increase/decrease everywhere. For e.g., we will just have a ContainerResourceChangeExpirer. I believe the ContainerResourceChangeExpirer only applies to the container resource increase action. The container decrease action goes directly through NM so it does not need an expiration logic. There will be races with container-states toggling from RUNNING to finished states, depending on when AM requests a size-change and when NMs report that a container finished. We can simply say that the state at the ResourceManager wins. Agreed. Didn't understand why we need this RM-NM confirmation. The token from RM to AM to NM should be enough for NM to update its view, right? This is the same as the reasons listed above. Instead of adding new records for ContainerResourceIncrease / decrease in AllocationResponse, should we add a new field in the API record itself stating if it is a New/Increased/Decreased container? If we move to a single change model, it's likely we will not even need this. I am open to this suggestion. We could add a field in the existing ContainerProto to indicate if this Container is new/increased/decreased container. The only thing I am not sure is if we can still change the AllocateResponseProto now that the ContainerResourceIncrease/Decrease is already in the trunk? Any obviously invalid change-requests should be rejected right-away. For e.g, an increase to more than cluster's max container size. Seemed like you are suggesting we ignore the invalid requests. Agreed that any invalid increase requests from AM to RM, and invalid decrease requests from AM to NM should be directly rejected. The 'ignore' case I was referring to is in the context of NodeUpdate from NM to RM. Nit: In the design doc, the high-level flow for container-increase point #7 incorrectly talks about decrease instead of increase. Yes, this is a mistake, and I will correct it. I propose we do this in a branch Definitely. There is already a YARN-1197 branch, and we can simply work in that branch. To Wangda Tan : Actually the appoarch in design doc is this (Meng plz let me know if I misunderstood). In scheduler's implementation, it allows only one pending change request for same container, later change-request will either overwrite prior one or rejected. The current design only allows one increase request in the whole system, which is guaranteed by the ContainerResourceIncreaseExpirer object. However, as explained above, we cannot block decrease action while an increase action is still in progress. 1) For the protocols between servers/AMs, mostly same to previous doc, the biggest change I can see is the ContainerResourceChangeProto in NodeHeartbeatResponseProto, which makes sense to me. Yes, the ContainerResourceChangeProto is the biggest change. Glad that you agree with this new protocol 2) For the client side change: 2.2.1, +1 to option 3. Great. I will remove option 1 and option 2 from the design doc. 3) For 2.3.3.2 scheduling part, The scheduling of an outstanding resource increase request to a container will be skipped if there are either: . Both of the two may not needed since AM can require for more resource when container increase (e.g. container increased to 4G, and AM wants it to be 6G before notify NM). Good point, this could be very convenient in practice. However the thing that I have not figured out is how to handle the increase token expiration logic if we have multiple increase actions going on at the same time. The current expiration logic (section 2.3.2 in the design doc) only tracks one increase request for a container (container ID + original capacity for rollback). As an example, if AM is currently using 2G, and asks to increase to 4G, and then asks again to increase to 6G, but AM doesn't actually use any of the token to increase the resource on NM. In this case, RM can only revert the resource allocation back to 4G after expiration, not 2G. 4) We may not need "reserved increase request", all increase request should be considered to be "reserved". But we still need to respect orders of applications in LeafQueue, no matter it's original FIFO or Fair (added after YARN-3306 ). We can discuss more scheduling details in separated JIRA. For sure. My knowledge in the scheduler side is still very limited, so I will continue to learn along the way. By the way, thanks for clearing up the JIRAs. It's great that you are able to work on the RM/Scheduler! I am glad to take any unassigned tasks
          Hide
          Wangda Tan added a comment -

          Short summary about past works to make sure they will not be wasted.

          Show
          Wangda Tan added a comment - Short summary about past works to make sure they will not be wasted. https://issues.apache.org/jira/secure/attachment/12618222/yarn-1449.5.patch contains changes of YARN-1449 and YARN-1643 . They are very likely can be reused. https://issues.apache.org/jira/secure/attachment/12619072/yarn-1502.2.patch contains changes of YARN-1646 and YARN-1651 . They are very likely can NOT be reused.
          Hide
          Wangda Tan added a comment -

          Hi MENG DING,
          I just completed clean up of sub JIRAs. I think some of them are too detailed. For example, it will be very hard to split works in FiCaSchedulerNode/FiCaSchedulerApp with changes of LeafQueue/ParentQueue. Following are JIRAs after cleanup:

          API:
          YARN-1449, API in NM side to support change container resource.
          YARN-1502, API changes in RM side to support change contaienr resource.

          Client:
          YARN-1509, AMRMClient
          YARN-1510, NMClient

          NM implementation
          YARN-1643, ContainerMonitor changes in NM

          RM implementation
          YARN-1646, Support change container resource in RM.
          YARN-1651, CapacityScheduler side changes.

          I unassigned myself from many of these JIRAs, but I still plan to implement changes in RM/CS side. For other JIRAs, please feel free to take.

          Show
          Wangda Tan added a comment - Hi MENG DING , I just completed clean up of sub JIRAs. I think some of them are too detailed. For example, it will be very hard to split works in FiCaSchedulerNode/FiCaSchedulerApp with changes of LeafQueue/ParentQueue. Following are JIRAs after cleanup: API: YARN-1449 , API in NM side to support change container resource. YARN-1502 , API changes in RM side to support change contaienr resource. Client: YARN-1509 , AMRMClient YARN-1510 , NMClient NM implementation YARN-1643 , ContainerMonitor changes in NM RM implementation YARN-1646 , Support change container resource in RM. YARN-1651 , CapacityScheduler side changes. I unassigned myself from many of these JIRAs, but I still plan to implement changes in RM/CS side. For other JIRAs, please feel free to take.
          Hide
          Wangda Tan added a comment -

          Thanks for MENG DING thinking and extending to the thorough design doc and review from Vinod Kumar Vavilapalli. I would really like to see this can be moved forward.

          To Vinod's comment:

          Didn't understand why we need this RM-NM confirmation. The token from RM to AM to NM should be enough for NM to update its view, right?

          This is to make sure RM/NM are synchronized, one example is mentioned in https://issues.apache.org/jira/browse/YARN-1197?focusedCommentId=14559284&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14559284. In this design, NM/RM are two-way communicate, so RM need acknowledage to NM about changes so that NM can change container monitoring status locally to avoid inconsistency happens.

          To your example of concurrent increase/decrease sizing requests from AM, shall we simply say that only one change-in-progress is allowed for any given container?

          Actually the appoarch in design doc is this (Meng plz let me know if I misunderstood). In scheduler's implementation, it allows only one pending change request for same container, later change-request will either overwrite prior one or rejected.

          Some feedbacks to the design doc so far:
          1) For the protocols between servers/AMs, mostly same to previous doc, the biggest change I can see is the ContainerResourceChangeProto in NodeHeartbeatResponseProto, which makes sense to me.
          2) For the client side change: 2.2.1, +1 to option 3.
          3) For 2.3.3.2 scheduling part, {{The scheduling of an outstanding resource increase request to a container will be skipped if there are
          either:}}. Both of the two may not needed since AM can require for more resource when container increase (e.g. container increased to 4G, and AM wants it to be 6G before notify NM).
          4) We may not need "reserved increase request", all increase request should be considered to be "reserved". But we still need to respect orders of applications in LeafQueue, no matter it's original FIFO or Fair (added after YARN-3306). We can discuss more scheduling details in separated JIRA.

          I will clean up subtasks (some of them are too detailed to me, especially for scheduler internal changes). Will post once I finished.

          Show
          Wangda Tan added a comment - Thanks for MENG DING thinking and extending to the thorough design doc and review from Vinod Kumar Vavilapalli . I would really like to see this can be moved forward. To Vinod's comment: Didn't understand why we need this RM-NM confirmation. The token from RM to AM to NM should be enough for NM to update its view, right? This is to make sure RM/NM are synchronized, one example is mentioned in https://issues.apache.org/jira/browse/YARN-1197?focusedCommentId=14559284&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14559284 . In this design, NM/RM are two-way communicate, so RM need acknowledage to NM about changes so that NM can change container monitoring status locally to avoid inconsistency happens. To your example of concurrent increase/decrease sizing requests from AM, shall we simply say that only one change-in-progress is allowed for any given container? Actually the appoarch in design doc is this (Meng plz let me know if I misunderstood). In scheduler's implementation, it allows only one pending change request for same container, later change-request will either overwrite prior one or rejected. Some feedbacks to the design doc so far: 1) For the protocols between servers/AMs, mostly same to previous doc, the biggest change I can see is the ContainerResourceChangeProto in NodeHeartbeatResponseProto , which makes sense to me. 2) For the client side change: 2.2.1, +1 to option 3. 3) For 2.3.3.2 scheduling part, {{The scheduling of an outstanding resource increase request to a container will be skipped if there are either:}}. Both of the two may not needed since AM can require for more resource when container increase (e.g. container increased to 4G, and AM wants it to be 6G before notify NM). 4) We may not need "reserved increase request", all increase request should be considered to be "reserved". But we still need to respect orders of applications in LeafQueue, no matter it's original FIFO or Fair (added after YARN-3306 ). We can discuss more scheduling details in separated JIRA. I will clean up subtasks (some of them are too detailed to me, especially for scheduler internal changes). Will post once I finished.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Tx for taking this up MENG DING!

          Read your updated doc. Looks good overall. Pretty comprehensive, great work!

          Some comments

          Major

          • Expanding containers at ACQUIRED state sounds useful in theory. But agree with you that we can punt it for later.
          • To your example of concurrent increase/decrease sizing requests from AM, shall we simply say that only one change-in-progress is allowed for any given container?
          • If we do the above, this will also simplify most of the code, as we will simply have the notion of a Change, instead of an explicit increase/decrease everywhere. For e.g., we will just have a ContainerResourceChangeExpirer.
          • There will be races with container-states toggling from RUNNING to finished states, depending on when AM requests a size-change and when NMs report that a container finished. We can simply say that the state at the ResourceManager wins.
          • After processing all resource change messages for a container in a node update, RM will set the current resource allocation known by RM for this container in the next node heartbeat response, so that NM will (eventually) have the same view of the resource allocation of this container with RM, and monitor/enforce accordingly.

            Didn't understand why we need this RM-NM confirmation. The token from RM to AM to NM should be enough for NM to update its view, right?

          Minor

          • Instead of adding new records for ContainerResourceIncrease / decrease in AllocationResponse, should we add a new field in the API record itself stating if it is a New/Increased/Decreased container? If we move to a single change model, it's likely we will not even need this.
          • Any obviously invalid change-requests should be rejected right-away. For e.g, an increase to more than cluster's max container size. Seemed like you are suggesting we ignore the invalid requests.
          • Nit: In the design doc, the high-level flow for container-increase point #7 incorrectly talks about decrease instead of increase.

          Just caught up with the rest of your discussion w.r.t decreasing container-sizes. the feature is useful outside of JVM processes - C code, servers managing their data off-heap etc, so we can continue working on it.

          Process

          I propose we do this in a branch. We got in a couple of patches earlier from Wangda Tan and then the feature unfortunately dropped on the floor. Branch helps avoid this going forward.

          Show
          Vinod Kumar Vavilapalli added a comment - Tx for taking this up MENG DING ! Read your updated doc. Looks good overall. Pretty comprehensive, great work! Some comments Major Expanding containers at ACQUIRED state sounds useful in theory. But agree with you that we can punt it for later. To your example of concurrent increase/decrease sizing requests from AM, shall we simply say that only one change-in-progress is allowed for any given container? If we do the above, this will also simplify most of the code, as we will simply have the notion of a Change , instead of an explicit increase/decrease everywhere. For e.g., we will just have a ContainerResourceChangeExpirer. There will be races with container-states toggling from RUNNING to finished states, depending on when AM requests a size-change and when NMs report that a container finished. We can simply say that the state at the ResourceManager wins. After processing all resource change messages for a container in a node update, RM will set the current resource allocation known by RM for this container in the next node heartbeat response, so that NM will (eventually) have the same view of the resource allocation of this container with RM, and monitor/enforce accordingly. Didn't understand why we need this RM-NM confirmation. The token from RM to AM to NM should be enough for NM to update its view, right? Minor Instead of adding new records for ContainerResourceIncrease / decrease in AllocationResponse, should we add a new field in the API record itself stating if it is a New/Increased/Decreased container? If we move to a single change model, it's likely we will not even need this. Any obviously invalid change-requests should be rejected right-away. For e.g, an increase to more than cluster's max container size. Seemed like you are suggesting we ignore the invalid requests. Nit: In the design doc, the high-level flow for container-increase point #7 incorrectly talks about decrease instead of increase. Just caught up with the rest of your discussion w.r.t decreasing container-sizes. the feature is useful outside of JVM processes - C code, servers managing their data off-heap etc, so we can continue working on it. Process I propose we do this in a branch. We got in a couple of patches earlier from Wangda Tan and then the feature unfortunately dropped on the floor. Branch helps avoid this going forward.
          MENG DING made changes -
          Attachment YARN-1197_Design.pdf [ 12735345 ]
          Hide
          MENG DING added a comment -

          We have come up with a new proposal to address the above problem, which we believe makes more sense.

          In essence, when RM receives a valid resource decrease message for a container, it should go ahead and honor it directly. If later on an increase message comes for the same container, and the target resource allocation is different than the current resource allocation known by RM, this increase action should be cancelled. To cancel this increase action, RM can simply use the NM-RM node heartbeat response to send the current resource allocation of this container known by RM back to NM. NM can use this value to monitor and enforce the resource usage of the container. In fact, we propose that after RM processes resource change messages from NM after each node update heartbeat, it should simply set the final resource allocation of those containers that have just been re-sized in the heartbeat response, so that NM will have the same view of the resource allocation of those changed containers as RM.

          Back to the original example:

          1. A container is currently using 6G
          2. AM asks RM to increase it to 8G
          3. RM grants the increase request, allocates the resource to the container to 8G, and issues a token to AM. It starts a timer and remembers the original resource allocation before the increase as 6G.
          4. AM, instead of initiating the resource increase to NM, requests a resource decrease to NM to decrease it to 4G
          5. The decrease is successful and RM gets the notification, and updates the container resource to 4G
          6. Before the token expires, the AM requests the resource increase to NM
          7. RM receives the resource increase message (8G) from node update. However, the current resource allocation of this container is 4G, which is different than 8G, RM will NOT consider this increase as valid. It unregister the increase request from the timer, and sets the current resource allocation (4G) to the node heartbeat response, which will be pulled by NM in the next heartbeat cycle.
          8. Once NM receives the 4G resource allocation, it will monitor and enforce using the 4G value.

          We have finished updating the design doc and have attached it to this thread (YARN-1197_Design.pdf). Many thanks to Tan, Wangda for the original design doc, which really helped us to catch up with all the discussion as quickly as we can. We hope to get your valuable feedback soon.

          We think most of the sub-tasks are still good as outlined in the updated design doc. Once we get approval of the design from the community, we will start the implementation.

          Show
          MENG DING added a comment - We have come up with a new proposal to address the above problem, which we believe makes more sense. In essence, when RM receives a valid resource decrease message for a container, it should go ahead and honor it directly. If later on an increase message comes for the same container, and the target resource allocation is different than the current resource allocation known by RM, this increase action should be cancelled. To cancel this increase action, RM can simply use the NM-RM node heartbeat response to send the current resource allocation of this container known by RM back to NM. NM can use this value to monitor and enforce the resource usage of the container. In fact, we propose that after RM processes resource change messages from NM after each node update heartbeat, it should simply set the final resource allocation of those containers that have just been re-sized in the heartbeat response, so that NM will have the same view of the resource allocation of those changed containers as RM. Back to the original example: 1. A container is currently using 6G 2. AM asks RM to increase it to 8G 3. RM grants the increase request, allocates the resource to the container to 8G, and issues a token to AM. It starts a timer and remembers the original resource allocation before the increase as 6G. 4. AM, instead of initiating the resource increase to NM, requests a resource decrease to NM to decrease it to 4G 5. The decrease is successful and RM gets the notification, and updates the container resource to 4G 6. Before the token expires, the AM requests the resource increase to NM 7. RM receives the resource increase message (8G) from node update. However, the current resource allocation of this container is 4G, which is different than 8G, RM will NOT consider this increase as valid. It unregister the increase request from the timer, and sets the current resource allocation (4G) to the node heartbeat response, which will be pulled by NM in the next heartbeat cycle. 8. Once NM receives the 4G resource allocation, it will monitor and enforce using the 4G value. We have finished updating the design doc and have attached it to this thread ( YARN-1197 _Design.pdf). Many thanks to Tan, Wangda for the original design doc, which really helped us to catch up with all the discussion as quickly as we can. We hope to get your valuable feedback soon. We think most of the sub-tasks are still good as outlined in the updated design doc. Once we get approval of the design from the community, we will start the implementation.
          Hide
          MENG DING added a comment -

          So to summarize the current dilemma:

          Situation:

          • A container resource increase request has been granted, and a token has been issued to AM, and
          • The increase action has not been fulfilled, and the token is not expired yet

          Problem:

          • AM can initiate a container resource decrease action to NM, and NM will fulfill it and notify RM, and then
          • Before the toke expires, AM can still initiate a container resource increase action to NM with the token, and NM will fulfill it and notify RM

          Proposed solution:

          • When RM receives a container decrease message from NM, it will first check if there is an outstanding container increase action (by checking the ContainerResourceIncreaseExpirer)
          • If the answer is no, RM will go ahead and update its internal resource bookkeeping and reduce the container resource allocation for this container.
          • If the answer is yes, RM will skip the resource reduction in this cycle, keep the resource decrease message in its newlyDecreasedContainers data structure, and check again in the next NM-RM heartbeat cycle.
          • If in the next heartbeat, a resource increase message to the same container comes, the previous resource decrease message will be dropped.

          Not sure if there are better solution to this problem. Let me know if this makes sense or not.

          Thanks,
          Meng

          Show
          MENG DING added a comment - So to summarize the current dilemma: Situation: A container resource increase request has been granted, and a token has been issued to AM, and The increase action has not been fulfilled, and the token is not expired yet Problem: AM can initiate a container resource decrease action to NM, and NM will fulfill it and notify RM, and then Before the toke expires, AM can still initiate a container resource increase action to NM with the token, and NM will fulfill it and notify RM Proposed solution: When RM receives a container decrease message from NM, it will first check if there is an outstanding container increase action (by checking the ContainerResourceIncreaseExpirer) If the answer is no, RM will go ahead and update its internal resource bookkeeping and reduce the container resource allocation for this container. If the answer is yes, RM will skip the resource reduction in this cycle, keep the resource decrease message in its newlyDecreasedContainers data structure, and check again in the next NM-RM heartbeat cycle. If in the next heartbeat, a resource increase message to the same container comes, the previous resource decrease message will be dropped. Not sure if there are better solution to this problem. Let me know if this makes sense or not. Thanks, Meng
          Hide
          MENG DING added a comment -

          Well, I think I spoke too soon

          The example I gave above is not entirely correct:

          1. A container is currently using 6G
          2. AM asks RM to increase it to 8G
          3. RM grants the increase request, allocates the resource to the container to 8G, and issues a token to AM. It starts a timer and remembers the original resource allocation before the increase as 6G.
          4. AM, instead of initiating the resource increase to NM, requests a resource decrease to NM to decrease it to 4G
          5. The decrease is successful and RM gets the notification, and updates the container resource to 4G
          6. Before the token expires, the AM requests the resource increase to NM
          7 The increase is successful and RM gets the notification, and updates the container resource back to 8G

          Step 6 and 7 should not be allowed because the RM has already reduced the container resource to 4G, which effectively invalidated the previous granted increase request (8G), even though the token has not yet expired.

          Show
          MENG DING added a comment - Well, I think I spoke too soon The example I gave above is not entirely correct: 1. A container is currently using 6G 2. AM asks RM to increase it to 8G 3. RM grants the increase request, allocates the resource to the container to 8G, and issues a token to AM. It starts a timer and remembers the original resource allocation before the increase as 6G. 4. AM, instead of initiating the resource increase to NM, requests a resource decrease to NM to decrease it to 4G 5. The decrease is successful and RM gets the notification, and updates the container resource to 4G 6. Before the token expires, the AM requests the resource increase to NM 7 The increase is successful and RM gets the notification, and updates the container resource back to 8G Step 6 and 7 should not be allowed because the RM has already reduced the container resource to 4G, which effectively invalidated the previous granted increase request (8G), even though the token has not yet expired.
          Hide
          MENG DING added a comment -

          Wangda Tan I totally agree that Yarn should not mess with Java Xmx. Sorry for not being clear before.

          While digging into the design details of this issue, there is (I think) a piece that is missing from the original design doc, which I hope to get some insights/clarifications from the community:

          It seems there is no discussion about the container resource increase token expiration logic.

          Here is what I think that should happen:

          1. AM sends a container resource increase request to RM.
          2. RM grants the request, allocating the additional resource to the container, updating its internal resource bookkeeping.
          3. During the next AM-RM heartbeat, RM pulls the newly increased container, creates a token for it and sets the token in the allocation response. In the meantime, RM starts a timer for this granted increase (e.g., register with ContainerResourceIncreaseExpirer).
          4. AM acquires the container resource increase token from the heartbeat response, then calls the NMClient API to launch the container resource increase action on NM.
          5. NM receives the request, increases the monitoring quota of the container, and notifies the NodeStatusUpdater.
          6. The NodeStatusUpdater informs the increase success to RM during regular NM-RM heartbeat.
          7. Upon receiving the increase success message, the RM stops the timer (e.g, unregister with ContainerResourceIncreaseExpirer).

          If, however, the timer in RM expires, and no increase success message is received for this container, RM must release the increased resource to the container, and update its internal resource bookkeeing.

          As such, NM-RM heartbeat must also include container resource increase message (which doesn't exist in the original design), otherwise the expiration logic will not work.

          In addition, RM must remember the original resource allocation to the container (this info may be stored in the ContainerResourceIncreaseExpirer), because in the case of expiration, RM needs to release the increased resource and revert back to the original resource allocation. This is different from a newly allocated container, in which case, RM simply needs to release the resource for the entire container when it expires.

          To make matters more complicated, after a container resource increase token has been given out, and before it expires, there is no guarantee that AM won't issue a resource decrease action on the same container. Because the resource decrease action starts from NM, NM has no idea that a resource increase token on the same container has been issued, and that a resource increase action could happen anytime.

          Given the above, here is what I propose to simplify things as much as we can without compromising the main functionality:

          At the RM side
          1. During each scheduling, if RM finds that there are still granted container resource increase sitting in RM (i.e., not yet acquired by AM), it will skip scheduling any outstanding resource increase request to the same container.
          2. During each scheduling, if RM finds that there is a granted container resource increase registered with ContainerResourceIncreaseExpirer, it will skip scheduling any outstanding resource increase request to the same container.

          This will guarantee that at any time, there can be one and only one resource increase request for a container.

          At the NM side
          1. Create a map to track any resource increase or decrease action for a container in NMContext. At any time, there can only be either an increase action or a decrease action going on for a specific container. While an increase/decrease action is in progress in NM, any new request from AM to increase/decrease resource to the same container will be rejected (with proper error messages).

          With the above logic, here is an example of what could happen:

          1. A container is currently using 6G
          2. AM asks RM to increase it to 8G
          3. RM grants the increase request, allocates the resource to the container to 8G, and issues a token to AM. It starts a timer and remembers the original resource allocation before the increase as 6G.
          4. AM, instead of initiating the resource increase to NM, requests a resource decrease to NM to decrease it to 4G
          5. The decrease is successful and RM gets the notification, and updates the container resource to 4G

          After this, two possible sequences may occur:

          6. Before the token expires, the AM requests the resource increase to NM
          7 The increase is successful and RM gets the notification, and updates the container resource back to 8G

          Or,

          6. AM never sends the resource increase to NM
          7. The token expires in RM. RM attempts to revert the resource increase (i.e., set the resource allocation back to 6G), but seeing that it is currently using 4G, it will do nothing.

          This is what I have for now. Sorry for the long email, and I hope that I have made myself clear. Please let me know if I am on the right track. Any comments/corrections/suggestions/advice will be extremely appreciated.

          Meng

          Show
          MENG DING added a comment - Wangda Tan I totally agree that Yarn should not mess with Java Xmx. Sorry for not being clear before. While digging into the design details of this issue, there is (I think) a piece that is missing from the original design doc, which I hope to get some insights/clarifications from the community: It seems there is no discussion about the container resource increase token expiration logic. Here is what I think that should happen: 1. AM sends a container resource increase request to RM. 2. RM grants the request, allocating the additional resource to the container, updating its internal resource bookkeeping. 3. During the next AM-RM heartbeat, RM pulls the newly increased container, creates a token for it and sets the token in the allocation response. In the meantime, RM starts a timer for this granted increase (e.g., register with ContainerResourceIncreaseExpirer). 4. AM acquires the container resource increase token from the heartbeat response, then calls the NMClient API to launch the container resource increase action on NM. 5. NM receives the request, increases the monitoring quota of the container, and notifies the NodeStatusUpdater. 6. The NodeStatusUpdater informs the increase success to RM during regular NM-RM heartbeat. 7. Upon receiving the increase success message, the RM stops the timer (e.g, unregister with ContainerResourceIncreaseExpirer). If, however, the timer in RM expires, and no increase success message is received for this container, RM must release the increased resource to the container, and update its internal resource bookkeeing . As such, NM-RM heartbeat must also include container resource increase message (which doesn't exist in the original design), otherwise the expiration logic will not work. In addition, RM must remember the original resource allocation to the container (this info may be stored in the ContainerResourceIncreaseExpirer), because in the case of expiration, RM needs to release the increased resource and revert back to the original resource allocation. This is different from a newly allocated container, in which case, RM simply needs to release the resource for the entire container when it expires. To make matters more complicated, after a container resource increase token has been given out, and before it expires, there is no guarantee that AM won't issue a resource decrease action on the same container. Because the resource decrease action starts from NM, NM has no idea that a resource increase token on the same container has been issued, and that a resource increase action could happen anytime. Given the above, here is what I propose to simplify things as much as we can without compromising the main functionality: At the RM side 1. During each scheduling, if RM finds that there are still granted container resource increase sitting in RM (i.e., not yet acquired by AM), it will skip scheduling any outstanding resource increase request to the same container. 2. During each scheduling, if RM finds that there is a granted container resource increase registered with ContainerResourceIncreaseExpirer, it will skip scheduling any outstanding resource increase request to the same container. This will guarantee that at any time, there can be one and only one resource increase request for a container. At the NM side 1. Create a map to track any resource increase or decrease action for a container in NMContext. At any time, there can only be either an increase action or a decrease action going on for a specific container. While an increase/decrease action is in progress in NM, any new request from AM to increase/decrease resource to the same container will be rejected (with proper error messages). With the above logic, here is an example of what could happen: 1. A container is currently using 6G 2. AM asks RM to increase it to 8G 3. RM grants the increase request, allocates the resource to the container to 8G, and issues a token to AM. It starts a timer and remembers the original resource allocation before the increase as 6G. 4. AM, instead of initiating the resource increase to NM, requests a resource decrease to NM to decrease it to 4G 5. The decrease is successful and RM gets the notification, and updates the container resource to 4G After this, two possible sequences may occur: 6. Before the token expires, the AM requests the resource increase to NM 7 The increase is successful and RM gets the notification, and updates the container resource back to 8G Or, 6. AM never sends the resource increase to NM 7. The token expires in RM. RM attempts to revert the resource increase (i.e., set the resource allocation back to 6G), but seeing that it is currently using 4G, it will do nothing. This is what I have for now. Sorry for the long email, and I hope that I have made myself clear. Please let me know if I am on the right track. Any comments/corrections/suggestions/advice will be extremely appreciated. Meng
          Hide
          Wangda Tan added a comment -

          I agree to what Karthik Kambatla mentioned, increasing Xmx doesn't like a very good idea, I think we should treat JVM as a blackbox and not trying to hack it from Yarn's perspective. It's fine if user's application do the Xmx stuffs to make it shrinkable.

          The reason why only support CPU enforcement is CPU enforcement in LCE is a soft limit, and memory is a hard limit which can lead to process failure when memory spike happens, you can check YARN-3 discussion for more details. YARN-2793 is different, it tries to define the behavior of killing an over-used container not how to enforce it.

          Dynamic update cgroup is also not supported, but I think we should support it with this ticket.

          Show
          Wangda Tan added a comment - I agree to what Karthik Kambatla mentioned, increasing Xmx doesn't like a very good idea, I think we should treat JVM as a blackbox and not trying to hack it from Yarn's perspective. It's fine if user's application do the Xmx stuffs to make it shrinkable. The reason why only support CPU enforcement is CPU enforcement in LCE is a soft limit, and memory is a hard limit which can lead to process failure when memory spike happens, you can check YARN-3 discussion for more details. YARN-2793 is different, it tries to define the behavior of killing an over-used container not how to enforce it. Dynamic update cgroup is also not supported, but I think we should support it with this ticket.
          Hide
          MENG DING added a comment -

          Thanks guys for the comments.

          Yes, I believe memory enforcement option to LCE is definitely a desirable feature and the proper way to handle memory enforcement for long running services. Looks like YARN-2793 is related, and YARN-3 already had a patch for this?

          Then we also need the capability to dynamically update cgroup that a process is run under, which I believe is not supported today either, right?

          Show
          MENG DING added a comment - Thanks guys for the comments. Yes, I believe memory enforcement option to LCE is definitely a desirable feature and the proper way to handle memory enforcement for long running services. Looks like YARN-2793 is related, and YARN-3 already had a patch for this? Then we also need the capability to dynamically update cgroup that a process is run under, which I believe is not supported today either, right?
          Hide
          Karthik Kambatla added a comment -

          We thought about launching the JVM based container with -Xmx set to the physical memory of the node, and use cgroup memory control to enforce the resource limit, but we don't think LCE supports memory isolation right now . We cannot use YARN's default memory enforcement as we don't want long running services to be killed.

          A JVM with a larger value for Xmx will likely be less aggressive with GC. Any resultant increase in heap size might or might not be a good thing. If you think this is something viable that people care about, we could consider adding a memory-enforcement option to LCE.

          Show
          Karthik Kambatla added a comment - We thought about launching the JVM based container with -Xmx set to the physical memory of the node, and use cgroup memory control to enforce the resource limit, but we don't think LCE supports memory isolation right now . We cannot use YARN's default memory enforcement as we don't want long running services to be killed. A JVM with a larger value for Xmx will likely be less aggressive with GC. Any resultant increase in heap size might or might not be a good thing. If you think this is something viable that people care about, we could consider adding a memory-enforcement option to LCE.
          Hide
          Wangda Tan added a comment -

          MENG DING,
          Thanks for interesting in this ticket, some comments:

          For JVM based containers (e.g., container running HBase), it is not possible right now to change the heap size of JVM without restarting the Java process. Even if we can implement a wrapper in the container to relaunch a Java process when resource is changed for a container, we still need to implement an interface between node manager and container to trigger the relaunch action.

          Good point, this is one thing we noted as well. I don't think there's any easy solution to shrink JVM. Relaunch the container could be one method, but it will be hard to make a generic "container wrapper" since kill and relaunch will make data in memory lost.

          But since the shrink memory is a proactive action, when a process wants to shrink its resource, it can use its own "container wrapper" to relaunch the process if it has some data recovery mechanism.

          Show
          Wangda Tan added a comment - MENG DING , Thanks for interesting in this ticket, some comments: For JVM based containers (e.g., container running HBase), it is not possible right now to change the heap size of JVM without restarting the Java process. Even if we can implement a wrapper in the container to relaunch a Java process when resource is changed for a container, we still need to implement an interface between node manager and container to trigger the relaunch action. Good point, this is one thing we noted as well. I don't think there's any easy solution to shrink JVM. Relaunch the container could be one method, but it will be hard to make a generic "container wrapper" since kill and relaunch will make data in memory lost. But since the shrink memory is a proactive action, when a process wants to shrink its resource, it can use its own "container wrapper" to relaunch the process if it has some data recovery mechanism.
          Hide
          MENG DING added a comment -

          We have a real use case to better support long running services on YARN, and to share resources between long running services and batch jobs. We have carefully reviewed discussions and documentations in this thread (and other topics related to this thread), and are committed to bring this work to completion. We agree with the general design of this feature, and understand that this is the result of an extensive discussions among many experts.

          We will attempt to post an updated design shortly for review.

          We don't really see a bottleneck at the scheduler side at this moment. However, we do see problems with memory enforcement for long running services.

          • For JVM based containers (e.g., container running HBase), it is not possible right now to change the heap size of JVM without restarting the Java process. Even if we can implement a wrapper in the container to relaunch a Java process when resource is changed for a container, we still need to implement an interface between node manager and container to trigger the relaunch action.
          • We thought about launching the JVM based container with -Xmx set to the physical memory of the node, and use cgroup memory control to enforce the resource limit, but we don't think LCE supports memory isolation right now . We cannot use YARN's default memory enforcement as we don't want long running services to be killed.

          So overall there doesn't seem to be an easy solution for memory enforcement without killing the long running services right now. Any comments or suggestions will be greatly appreciated.

          Thanks,
          Meng

          Show
          MENG DING added a comment - We have a real use case to better support long running services on YARN, and to share resources between long running services and batch jobs. We have carefully reviewed discussions and documentations in this thread (and other topics related to this thread), and are committed to bring this work to completion. We agree with the general design of this feature, and understand that this is the result of an extensive discussions among many experts. We will attempt to post an updated design shortly for review. We don't really see a bottleneck at the scheduler side at this moment. However, we do see problems with memory enforcement for long running services. For JVM based containers (e.g., container running HBase), it is not possible right now to change the heap size of JVM without restarting the Java process. Even if we can implement a wrapper in the container to relaunch a Java process when resource is changed for a container, we still need to implement an interface between node manager and container to trigger the relaunch action. We thought about launching the JVM based container with -Xmx set to the physical memory of the node, and use cgroup memory control to enforce the resource limit, but we don't think LCE supports memory isolation right now . We cannot use YARN's default memory enforcement as we don't want long running services to be killed. So overall there doesn't seem to be an easy solution for memory enforcement without killing the long running services right now. Any comments or suggestions will be greatly appreciated. Thanks, Meng
          Jeff Zhang made changes -
          Assignee Jeff Zhang [ zjffdu ]
          Hide
          Wangda Tan added a comment -

          I'm glad if anybody can continue this work.

          Since potentially it needs huge effort to get completed, to get this continue, I think we need to:

          • High level discussion: Since the design doc, patches and tasks were created about 1 years ago, some of them need rethink/amendment and some of them were totally stale. For example, is there any actual use cases of this JIRA? Are there any downstream projects plan to consume these APIs? Are there any alternative ticket can do the similar proposal (Like YARN-1488)?.
          • Implementation: After high level discussion, I think we can think about plan to implement it, there're bunch of sub tasks I created for this ticket. If you're interested in any of them, just let me know and I can assign that to you. If you don't agree with the task coverage/granularity, please feel free to create a new ticket and I can close the original one.

          And I think the umbrella JIRA should leave empty, Jeff Zhang if you're not working on it, could you unassign it?

          Thoughts?

          Thanks,
          Wangda

          Show
          Wangda Tan added a comment - I'm glad if anybody can continue this work. Since potentially it needs huge effort to get completed, to get this continue, I think we need to: High level discussion: Since the design doc, patches and tasks were created about 1 years ago, some of them need rethink/amendment and some of them were totally stale. For example, is there any actual use cases of this JIRA? Are there any downstream projects plan to consume these APIs? Are there any alternative ticket can do the similar proposal (Like YARN-1488 )?. Implementation: After high level discussion, I think we can think about plan to implement it, there're bunch of sub tasks I created for this ticket. If you're interested in any of them, just let me know and I can assign that to you. If you don't agree with the task coverage/granularity, please feel free to create a new ticket and I can close the original one. And I think the umbrella JIRA should leave empty, Jeff Zhang if you're not working on it, could you unassign it? Thoughts? Thanks, Wangda
          Hide
          Tsuyoshi Ozawa added a comment -

          Wangda Tan, could you review your work and unassign some of your tasks unless you don't start to work? This feature is useful for some YARN applications(e.g. Spark and Tez).

          Show
          Tsuyoshi Ozawa added a comment - Wangda Tan , could you review your work and unassign some of your tasks unless you don't start to work? This feature is useful for some YARN applications(e.g. Spark and Tez).
          Hide
          Chen He added a comment -

          I can also contribute some time on this JIRA.

          Show
          Chen He added a comment - I can also contribute some time on this JIRA.
          Jeff Hammerbacher made changes -
          Link This issue relates to SPARK-3174 [ SPARK-3174 ]
          Hide
          Karthik Kambatla (Inactive) added a comment -

          I am interested in working this. Depending on the progress, I ll be glad to write patches or review them.

          Show
          Karthik Kambatla (Inactive) added a comment - I am interested in working this. Depending on the progress, I ll be glad to write patches or review them.
          Jeff Zhang made changes -
          Assignee Jeff Zhang [ zjffdu ]
          Hide
          Wangda Tan (No longer used) added a comment -

          I'm leaving my current company on next week, and am no longer involved in YARN-1197, one of my colleagues will take this Jira and sub tasks.

          Show
          Wangda Tan (No longer used) added a comment - I'm leaving my current company on next week, and am no longer involved in YARN-1197 , one of my colleagues will take this Jira and sub tasks.
          Wangda Tan (No longer used) made changes -
          Assignee Wangda Tan [ gp.leftnoteasy ]
          Hide
          Wangda Tan added a comment -

          I attached patches for NM side changes, YARN-1643, YARN-1644, YARN-1645. Can someone give it a review? Thanks!
          Bikas Saha Sandy Ryza Vinod Kumar Vavilapalli

          Show
          Wangda Tan added a comment - I attached patches for NM side changes, YARN-1643 , YARN-1644 , YARN-1645 . Can someone give it a review? Thanks! Bikas Saha Sandy Ryza Vinod Kumar Vavilapalli
          Hide
          Wangda Tan added a comment -

          I created a bunch of sub tasks for easier review,
          NM side changes are: YARN-1643, YARN-1644, YARN-1645, instead of big task – YARN-1449; I'll work on them first.
          RM side changes are: YARN-1646, YARN-1647, YARN-1648, YARN-1649, YARN-1650, YARN-1651, YARN-1652, YARN-1653, YARN-1654, instead of big task – YARN-1502. These sub tasks will add change container resource support to capacity scheduler and change corresponding implementations in RM side; I'll break-down existing patch in YARN-1502 and submit after changes in NM completed.
          YARN-1655 will try to add support to fair scheduler, I plan to work on this after changes in capacity scheduler completed.

          Show
          Wangda Tan added a comment - I created a bunch of sub tasks for easier review, NM side changes are: YARN-1643 , YARN-1644 , YARN-1645 , instead of big task – YARN-1449 ; I'll work on them first. RM side changes are: YARN-1646 , YARN-1647 , YARN-1648 , YARN-1649 , YARN-1650 , YARN-1651 , YARN-1652 , YARN-1653 , YARN-1654 , instead of big task – YARN-1502 . These sub tasks will add change container resource support to capacity scheduler and change corresponding implementations in RM side; I'll break-down existing patch in YARN-1502 and submit after changes in NM completed. YARN-1655 will try to add support to fair scheduler, I plan to work on this after changes in capacity scheduler completed.
          Hide
          Wangda Tan (No longer used) added a comment -

          Thanks Sandy, I'm working on breaking down RM patch and will create sub-jiras for better review.

          Show
          Wangda Tan (No longer used) added a comment - Thanks Sandy, I'm working on breaking down RM patch and will create sub-jiras for better review.
          Hide
          Sandy Ryza added a comment -

          Created a YARN-1197 branch. Will revert the commits in trunk and branch-2 soon.

          Show
          Sandy Ryza added a comment - Created a YARN-1197 branch. Will revert the commits in trunk and branch-2 soon.
          Hide
          Arun C Murthy added a comment -

          I'd appreciate that, thanks Sandy Ryza!

          Show
          Arun C Murthy added a comment - I'd appreciate that, thanks Sandy Ryza !
          Hide
          Sandy Ryza added a comment -

          Arun C Murthy, any progress on the branch? If not, I'd be happy to take care of it.

          Show
          Sandy Ryza added a comment - Arun C Murthy , any progress on the branch? If not, I'd be happy to take care of it.
          Hide
          Wangda Tan (No longer used) added a comment -

          Arun C Murthy, great, thanks

          Show
          Wangda Tan (No longer used) added a comment - Arun C Murthy , great, thanks
          Hide
          Arun C Murthy added a comment -

          Thanks Tan, Wangda, glad we agree. I'll prepare a branch and move the commits there. Thanks again for your contributions and for being so flexible.

          Show
          Arun C Murthy added a comment - Thanks Tan, Wangda , glad we agree. I'll prepare a branch and move the commits there. Thanks again for your contributions and for being so flexible.
          Hide
          Wangda Tan (No longer used) added a comment -

          Copy text from scheduler design doc to here for easier discussion, please feel free to let me know your comments!

          Basic Requirements
          We need support handling resource increase request from AM and resource decrease notify from NM

          • Such resource changes should reflect to FiCaSchedulerNode/ App, LeafQueue, ParentQueue (like usedResource, reservedResource, etc.)
          • If user requested an increase request and not be satisfied immediately, it will be reserved in node/app (The node/app means FiCaSchedulerApp/Node, same in below) like before.

          Advanced Requirements

          • We need gracefully handle racing conditions,
            • Only acquired/running containers can be increased
            • Container decreasing will only take effect in acquired/running containers. (If a container is finished/killed, etc. All of its resource will be released, we don’t need decrease it)
            • User may request a new increase requests on a container, and a pending increase request for the same container existed. We need replace the pending with the new one.
            • When a requested container resource is less or equal to existing container resource.
          • This will be ignored if no pending increase request for this container
          • This will be ignored and the pending increase request will be canceled
            • When a pending increase request existed, and a decrease container notify on the same container comes, this container will be decreased and the pending increase request will be canceled

          Requirements not clear

          • Do we need a time-out parameter for reserved resource increase request to avoid it occupy the node resource too long? (Do we have such parameter for reserve a “normal” container in CS?)
          • How to decide which of increase request and normal container request will be satisfied first? (Currently, I simply make CS satisfy increase request first). Should it be a configurable parameter?

          Current Implementation

          1) Decrease Container
          I start with decrease container because it’s more easier to understand,
          Decreased container will be handled in nodeUpdate() of Capacity scheduler.
          When CS received decreased containers from NM, it will process them one by one by following steps

          • Check if it’s in running state (Because this is reported by NM, it’s state will either be running or completed), skip if no.
          • Remove increase request on the same container-id if it exists
          • Decrease/Update container resource in FiCaSchedulerApp/AppSchedulingInfo/FiCaSchedulerNode/LeafQueue/ParentQueue/other-related-metrics
          • Update resource in Container.
          • Return decreased container to AM by calling setDecreasedContainer in AllocateResponse

          2) Increase Container
          Increasing container will be much more complex than decreasing,

          Steps to add container increase request, (pseudo code)
          In CapacityScheduler.allocate(...)

              foreach (increase_request):
                  if (state != ACQUIRED) and (state != RUNNING):
                      continue;
          
                  // Remove the old request on the same container-id if it exists
                  if increase_request_exist(increase_request.getContainerId()):
                      remove(increaseRequest);
          
                  // Ask target resource should larger than existing resource
                  if increase_request.ask_resource <= 
                  existing_resource(increase_request.getContainerId()):
                      continue;
          
                  // Add it to application
                  getApplication(increase_request.getContainerId()).add(increase_request)
          

          Steps to handle container increase request,
          2.1) In CapacityScheduler.nodeUpdate(...):

              if node.is_reserved():
                  if reserved-increase-request:
                      LeafQueue.assignReservedIncreaseRequest(...)
                  elif reserved-normal-container:
                      ...
              else:
                  ParentQueue.assignContainers(...)
                  // this will finally call 
                  // LeafQueue.assignContainers(...)
          

          2.2) In CapacityScheduler.nodeUpdate(...):

              if request-is-fit-in-resource:
                  allocate-resource
                  update container token
                  add to AllocateResponse
                  return allocated-resource
              else:
                  return None
          

          2.3) In LeafQueue.assignContainers(...):

              foreach (application):
                  // do increase allocation first
                  foreach (increase_request):
                      // check if we can allocate it
                      // in queue/user limites, etc.
                      // return None if not satisfied
          
                      if request-is-fit-in-resource:
                          allocate-resource
                          update container token
                          add to AllocateResponse
                      else:
                          reserve in app/node
                          return reserved-resource
          
                  // do normal allocation
                  ...
          

          API changes in CapacityScheduler
          1) YarnScheduler

             public Allocation allocate(ApplicationAttemptId applicationAttemptId,
                 List<ResourceRequest> ask, List<ContainerId> release,
                 List<String> blacklistAdditions, List<String> blacklistRemovals,
          +    List<ContainerResourceIncreaseRequest> increaseRequests)
          

          2) CSQueue

          +  public void cancelIncreaseRequestReservation(Resource clusterResource,
          +      ContainerResourceIncreaseRequest changeRequest, Resource required);
          +  
          +  public void decreaseContainerResource(FiCaSchedulerApp application,
          +      Resource clusterResource, Resource released);
          

          3) FiCaSchedulerApp

          +  synchronized public List<ContainerResourceIncreaseRequest>
          +      getResourceIncreaseRequest(NodeId nodeId)
          +
          +  synchronized public ContainerResourceIncreaseRequest
          +      getResourceIncreaseRequest(NodeId nodeId, ContainerId containerId);
          +
          +  synchronized public void removeIncreaseRequest(NodeId nodeId,
          +      ContainerId containerId);
          +
          +  synchronized public void decreaseContainerResource(Resource released);
          

          4) FiCaSchedulerNode

          + public synchronized void increaseResource(Resource resource);
          + public synchronized void decreaseContainerResource(Resource resource);
          +  public synchronized void reserveIncreaseResource(
          +      ContainerResourceIncreaseRequest increaseRequest);
          +  public synchronized void unreserveIncreaseResource(ContainerId containerId)
          

          5) AppSchedulingInfo

          +  synchronized public void addIncreaseRequests(NodeId nodeId,
          +      ContainerResourceIncreaseRequest increaseRequest, Resource required)
          +  synchronized public void decreaseContainerResource(Resource released)
          +  synchronized public List<ContainerResourceIncreaseRequest>
          +      getResourceIncreaseRequests(NodeId nodeId)
          +  synchronized public ContainerResourceIncreaseRequest
          +      getResourceIncreaseRequests(NodeId nodeId, ContainerId containerId)
          +  synchronized public void allocateIncreaseRequest(Resource required)
          +  synchronized public void removeIncreaseRequest(NodeId nodeId, 
          +      ContainerId containerId)
          
          Show
          Wangda Tan (No longer used) added a comment - Copy text from scheduler design doc to here for easier discussion, please feel free to let me know your comments! Basic Requirements We need support handling resource increase request from AM and resource decrease notify from NM Such resource changes should reflect to FiCaSchedulerNode/ App, LeafQueue, ParentQueue (like usedResource, reservedResource, etc.) If user requested an increase request and not be satisfied immediately, it will be reserved in node/app (The node/app means FiCaSchedulerApp/Node, same in below) like before. Advanced Requirements We need gracefully handle racing conditions, Only acquired/running containers can be increased Container decreasing will only take effect in acquired/running containers. (If a container is finished/killed, etc. All of its resource will be released, we don’t need decrease it) User may request a new increase requests on a container, and a pending increase request for the same container existed. We need replace the pending with the new one. When a requested container resource is less or equal to existing container resource. This will be ignored if no pending increase request for this container This will be ignored and the pending increase request will be canceled When a pending increase request existed, and a decrease container notify on the same container comes, this container will be decreased and the pending increase request will be canceled Requirements not clear Do we need a time-out parameter for reserved resource increase request to avoid it occupy the node resource too long? (Do we have such parameter for reserve a “normal” container in CS?) How to decide which of increase request and normal container request will be satisfied first? (Currently, I simply make CS satisfy increase request first). Should it be a configurable parameter? Current Implementation 1) Decrease Container I start with decrease container because it’s more easier to understand, Decreased container will be handled in nodeUpdate() of Capacity scheduler. When CS received decreased containers from NM, it will process them one by one by following steps Check if it’s in running state (Because this is reported by NM, it’s state will either be running or completed), skip if no. Remove increase request on the same container-id if it exists Decrease/Update container resource in FiCaSchedulerApp/AppSchedulingInfo/FiCaSchedulerNode/LeafQueue/ParentQueue/other-related-metrics Update resource in Container. Return decreased container to AM by calling setDecreasedContainer in AllocateResponse 2) Increase Container Increasing container will be much more complex than decreasing, Steps to add container increase request, (pseudo code) In CapacityScheduler.allocate(...) foreach (increase_request): if (state != ACQUIRED) and (state != RUNNING): continue ; // Remove the old request on the same container-id if it exists if increase_request_exist(increase_request.getContainerId()): remove(increaseRequest); // Ask target resource should larger than existing resource if increase_request.ask_resource <= existing_resource(increase_request.getContainerId()): continue ; // Add it to application getApplication(increase_request.getContainerId()).add(increase_request) Steps to handle container increase request, 2.1) In CapacityScheduler.nodeUpdate(...): if node.is_reserved(): if reserved-increase-request: LeafQueue.assignReservedIncreaseRequest(...) elif reserved-normal-container: ... else : ParentQueue.assignContainers(...) // this will finally call // LeafQueue.assignContainers(...) 2.2) In CapacityScheduler.nodeUpdate(...): if request-is-fit-in-resource: allocate-resource update container token add to AllocateResponse return allocated-resource else : return None 2.3) In LeafQueue.assignContainers(...): foreach (application): // do increase allocation first foreach (increase_request): // check if we can allocate it // in queue/user limites, etc. // return None if not satisfied if request-is-fit-in-resource: allocate-resource update container token add to AllocateResponse else : reserve in app/node return reserved-resource // do normal allocation ... API changes in CapacityScheduler 1) YarnScheduler public Allocation allocate(ApplicationAttemptId applicationAttemptId, List<ResourceRequest> ask, List<ContainerId> release, List< String > blacklistAdditions, List< String > blacklistRemovals, + List<ContainerResourceIncreaseRequest> increaseRequests) 2) CSQueue + public void cancelIncreaseRequestReservation(Resource clusterResource, + ContainerResourceIncreaseRequest changeRequest, Resource required); + + public void decreaseContainerResource(FiCaSchedulerApp application, + Resource clusterResource, Resource released); 3) FiCaSchedulerApp + synchronized public List<ContainerResourceIncreaseRequest> + getResourceIncreaseRequest(NodeId nodeId) + + synchronized public ContainerResourceIncreaseRequest + getResourceIncreaseRequest(NodeId nodeId, ContainerId containerId); + + synchronized public void removeIncreaseRequest(NodeId nodeId, + ContainerId containerId); + + synchronized public void decreaseContainerResource(Resource released); 4) FiCaSchedulerNode + public synchronized void increaseResource(Resource resource); + public synchronized void decreaseContainerResource(Resource resource); + public synchronized void reserveIncreaseResource( + ContainerResourceIncreaseRequest increaseRequest); + public synchronized void unreserveIncreaseResource(ContainerId containerId) 5) AppSchedulingInfo + synchronized public void addIncreaseRequests(NodeId nodeId, + ContainerResourceIncreaseRequest increaseRequest, Resource required) + synchronized public void decreaseContainerResource(Resource released) + synchronized public List<ContainerResourceIncreaseRequest> + getResourceIncreaseRequests(NodeId nodeId) + synchronized public ContainerResourceIncreaseRequest + getResourceIncreaseRequests(NodeId nodeId, ContainerId containerId) + synchronized public void allocateIncreaseRequest(Resource required) + synchronized public void removeIncreaseRequest(NodeId nodeId, + ContainerId containerId)
          Wangda Tan (No longer used) made changes -
          Attachment yarn-1197-scheduler-v1.pdf [ 12618425 ]
          Hide
          Wangda Tan (No longer used) added a comment -

          Attached scheduler design doc for increasing and decreasing, I've uploaded a very draft preview patch for scheduler changes in YARN-1502

          Show
          Wangda Tan (No longer used) added a comment - Attached scheduler design doc for increasing and decreasing, I've uploaded a very draft preview patch for scheduler changes in YARN-1502
          Hide
          Wangda Tan (No longer used) added a comment -

          My thoughts after reading Bikas and Sandy's comments,
          I think the two Jiras are tackling different problem, and like what Sandy said, the container increase/decrease can give scheduler more inputs to optimize such increase requests (like user can configure "increase request priority" higher than "normal container request priority", which can faster handle such increase operation to support low latency services).
          And I think if user want to use container delegation instead of container increasing just to make more dynamic resource (not shared between users/applications), I've some concerns beyond Sandy's thinking.

          • If we treat all delegated resource as a same container, why not simplely merge them (like what I originally proposed in this Jira), merge them will make us easier to manage preemption, etc.
          • How to deal with running container merge (a container is already running, and what should be happen if we delegate its resource to another container). If we can only delegate container in ACQUIRED state, we need to deal with timeout to launch a container before it taken back by RM
          • Can we do container "de-delegation"?

          In general, I support to use container increase/decrease to deal with resource changing within a single application.

          Show
          Wangda Tan (No longer used) added a comment - My thoughts after reading Bikas and Sandy's comments, I think the two Jiras are tackling different problem, and like what Sandy said, the container increase/decrease can give scheduler more inputs to optimize such increase requests (like user can configure "increase request priority" higher than "normal container request priority", which can faster handle such increase operation to support low latency services). And I think if user want to use container delegation instead of container increasing just to make more dynamic resource (not shared between users/applications), I've some concerns beyond Sandy's thinking. If we treat all delegated resource as a same container, why not simplely merge them (like what I originally proposed in this Jira), merge them will make us easier to manage preemption, etc. How to deal with running container merge (a container is already running, and what should be happen if we delegate its resource to another container). If we can only delegate container in ACQUIRED state, we need to deal with timeout to launch a container before it taken back by RM Can we do container "de-delegation"? In general, I support to use container increase/decrease to deal with resource changing within a single application.
          Hide
          Bikas Saha added a comment -

          I am afraid we seem to mixing things over here. This jira deals with the issue of increasing and decreasing the resources of an allocated container. There are clear use cases for it like mentioned in previous comments on this jira. e.g. having a long running worker daemon increase and decrease its resources depending on load. We have already discussed at length on this jira on how increasing a container resource is internally no different than requesting for an additional container and merging it with an existing container. However the new container followed by merge is way more complicated for the user and adds additional complexity to the system (eg how to deal with the new container that was merged into the old one). This complexity is in addition to the common work with simply increasing resources on a given container. wrt the user, asking for a container and being able to increase its resources will give the same effect as asking for many containers and merging them.
          The scenario for YARN-1488 is logically different. That covers the case when an app wants to use a shared service and purchases that service by transferring its own container resource to that shared service that itself is running inside YARN. The consumer app may never need to increase its own container resource. Secondly, the shared service is not requesting an increase in its own container resources. So this jira does not come into the picture at all.

          I believe we have a clear and cleanly separated piece of useful functionality being implemented in this jira. We should go ahead and bring this work to completion and facilitate the creation of long running services in YARN.

          wrt doing this in a branch. There are new API's being added here for which functionality does not exist or is not supported yet. And none of that code will get executed until clients actually support doing it or someone writes code against it. So I dont think that any of this is going to destabilize the code base. I agree that the scheduler changes are going to be complicated. We can do them in the end when all the plumbing is in place and they could be separate jiras for each scheduler. Of course, schedulers would want their own flags to turn this on/off. So its not clear to me what benefits a branch would bring here but it would entail the overhead of maintenance and lack of test automation. Does this mean that every feature addition to YARN needs to be done in a branch? I propose we do this work in trunk and later merge it into branch-2 when we are satisfied with its stability.

          Show
          Bikas Saha added a comment - I am afraid we seem to mixing things over here. This jira deals with the issue of increasing and decreasing the resources of an allocated container. There are clear use cases for it like mentioned in previous comments on this jira. e.g. having a long running worker daemon increase and decrease its resources depending on load. We have already discussed at length on this jira on how increasing a container resource is internally no different than requesting for an additional container and merging it with an existing container. However the new container followed by merge is way more complicated for the user and adds additional complexity to the system (eg how to deal with the new container that was merged into the old one). This complexity is in addition to the common work with simply increasing resources on a given container. wrt the user, asking for a container and being able to increase its resources will give the same effect as asking for many containers and merging them. The scenario for YARN-1488 is logically different. That covers the case when an app wants to use a shared service and purchases that service by transferring its own container resource to that shared service that itself is running inside YARN. The consumer app may never need to increase its own container resource. Secondly, the shared service is not requesting an increase in its own container resources. So this jira does not come into the picture at all. I believe we have a clear and cleanly separated piece of useful functionality being implemented in this jira. We should go ahead and bring this work to completion and facilitate the creation of long running services in YARN. wrt doing this in a branch. There are new API's being added here for which functionality does not exist or is not supported yet. And none of that code will get executed until clients actually support doing it or someone writes code against it. So I dont think that any of this is going to destabilize the code base. I agree that the scheduler changes are going to be complicated. We can do them in the end when all the plumbing is in place and they could be separate jiras for each scheduler. Of course, schedulers would want their own flags to turn this on/off. So its not clear to me what benefits a branch would bring here but it would entail the overhead of maintenance and lack of test automation. Does this mean that every feature addition to YARN needs to be done in a branch? I propose we do this work in trunk and later merge it into branch-2 when we are satisfied with its stability.
          Hide
          Sandy Ryza added a comment -

          Wanted to post some thoughts on this vs. YARN-1488. YARN-1488 proposes that if you receive a container on a node you should be able to delegate it to another container already running on that node, essentially adding the resources of the container you received to the allocation of the running container. This sounds a lot like a resource increase. The differences are that:

          • With the mechanism proposed in this JIRA, the request is explicitly an increase and mentions the container you want to add it to. This allows the scheduler to use special logic for handling increase requests.
          • With the mechanism proposed on YARN-1488, a container can be used to increase the resources of a container from another application.
          • With the mechanism proposed here, after satisfying the increase request, the scheduler is tracking a single larger container, not multiple small ones.

          An advantage of treating an increase request the same as a regular container request is that an application could submit it at the same time. I.e. if I want a single container with as many resources as possible on node X, I can request a number of containers on that node, wait for some time period for allocations to accrue, and then run them all as a single container.

          I think the deciding factor might be how preemption functions here. I.e. what is the preemptable unit - can we preempt a part of a container? Have some thoughts but will try to organize them more before posting here.

          Show
          Sandy Ryza added a comment - Wanted to post some thoughts on this vs. YARN-1488 . YARN-1488 proposes that if you receive a container on a node you should be able to delegate it to another container already running on that node, essentially adding the resources of the container you received to the allocation of the running container. This sounds a lot like a resource increase. The differences are that: With the mechanism proposed in this JIRA, the request is explicitly an increase and mentions the container you want to add it to. This allows the scheduler to use special logic for handling increase requests. With the mechanism proposed on YARN-1488 , a container can be used to increase the resources of a container from another application. With the mechanism proposed here, after satisfying the increase request, the scheduler is tracking a single larger container, not multiple small ones. An advantage of treating an increase request the same as a regular container request is that an application could submit it at the same time. I.e. if I want a single container with as many resources as possible on node X, I can request a number of containers on that node, wait for some time period for allocations to accrue, and then run them all as a single container. I think the deciding factor might be how preemption functions here. I.e. what is the preemptable unit - can we preempt a part of a container? Have some thoughts but will try to organize them more before posting here.
          Hide
          Wangda Tan (No longer used) added a comment -

          Sorry I missed the reply from Sandy :-p

          Show
          Wangda Tan (No longer used) added a comment - Sorry I missed the reply from Sandy :-p
          Hide
          Wangda Tan (No longer used) added a comment -

          Hi Vinod,
          Thanks for jumping in, my idea on your questions,

          Seems like the control flow is asymetrical for resource decrease. We directly go to the node first. Is that intended? On first look, that seems fine - decreasing resource usage on a node is akin to killing a container by talking to NM directly.

          Yes, we discussed this in this Jira (credit to Bikas, Sandy and Tucu), I think decreasing resource is a similar operation comparing to kill a container

          In such applications that decrease container-resource, will the application first instruct its container to reduce the resource usage and then inform the platform? The reason this is important is if it doesn't happen that way, node will forcefully either kill it when monitoring resource usage or change its cgroup immediately causing the container to swap.

          Yes I think, AM should notify NM about this when it make sure resource usage is already reduced in containers to avoid container killed by NM.

          I support to move this to a branch to make it nice completed before merging it to trunk.

          Show
          Wangda Tan (No longer used) added a comment - Hi Vinod, Thanks for jumping in, my idea on your questions, Seems like the control flow is asymetrical for resource decrease. We directly go to the node first. Is that intended? On first look, that seems fine - decreasing resource usage on a node is akin to killing a container by talking to NM directly. Yes, we discussed this in this Jira (credit to Bikas, Sandy and Tucu), I think decreasing resource is a similar operation comparing to kill a container In such applications that decrease container-resource, will the application first instruct its container to reduce the resource usage and then inform the platform? The reason this is important is if it doesn't happen that way, node will forcefully either kill it when monitoring resource usage or change its cgroup immediately causing the container to swap. Yes I think, AM should notify NM about this when it make sure resource usage is already reduced in containers to avoid container killed by NM. I support to move this to a branch to make it nice completed before merging it to trunk.
          Hide
          Sandy Ryza added a comment -

          Seems like the control flow is asymetrical for resource decrease. We directly go to the node first. Is that intended? On first look, that seems fine - decreasing resource usage on a node is akin to killing a container by talking to NM directly.

          This is intentional - we went through a few different flows before settling on this approach. The analogy with killing the container was one of the reasons for this.

          In such applications that decrease container-resource, will the application first instruct its container to reduce the resource usage and then inform the platform? The reason this is important is if it doesn't happen that way, node will forcefully either kill it when monitoring resource usage or change its cgroup immediately causing the container to swap.

          When reducing memory, the application should inform the container process before informing the NodeManager. When only reducing CPU, there will probably be situations where only informing the platform is necessary.

          To avoid branch-rot, we could target a subset, say just the resource-increase changes in the branch and do the remaining work on trunk after merge.

          Sounds reasonable to me.

          Show
          Sandy Ryza added a comment - Seems like the control flow is asymetrical for resource decrease. We directly go to the node first. Is that intended? On first look, that seems fine - decreasing resource usage on a node is akin to killing a container by talking to NM directly. This is intentional - we went through a few different flows before settling on this approach. The analogy with killing the container was one of the reasons for this. In such applications that decrease container-resource, will the application first instruct its container to reduce the resource usage and then inform the platform? The reason this is important is if it doesn't happen that way, node will forcefully either kill it when monitoring resource usage or change its cgroup immediately causing the container to swap. When reducing memory, the application should inform the container process before informing the NodeManager. When only reducing CPU, there will probably be situations where only informing the platform is necessary. To avoid branch-rot, we could target a subset, say just the resource-increase changes in the branch and do the remaining work on trunk after merge. Sounds reasonable to me.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          I just caught up with this. Well written document, thanks! Some questions:

          • Decreasing resources:
            • Seems like the control flow is asymetrical for resource decrease. We directly go to the node first. Is that intended? On first look, that seems fine - decreasing resource usage on a node is akin to killing a container by talking to NM directly.
            • In such applications that decrease container-resource, will the application first instruct its container to reduce the resource usage and then inform the platform? The reason this is important is if it doesn't happen that way, node will forcefully either kill it when monitoring resource usage or change its cgroup immediately causing the container to swap.

          Also, I can see that some of the scheduler changes are going to be pretty involved. I'd also vote for a branch. A couple of patches already went in and I'm not even sure we already got them right and/or if they need more revisions as we start making core changes. To avoid branch-rot, we could target a subset, say just the resource-increase changes in the branch and do the remaining work on trunk after merge.

          Show
          Vinod Kumar Vavilapalli added a comment - I just caught up with this. Well written document, thanks! Some questions: Decreasing resources: Seems like the control flow is asymetrical for resource decrease. We directly go to the node first. Is that intended? On first look, that seems fine - decreasing resource usage on a node is akin to killing a container by talking to NM directly. In such applications that decrease container-resource, will the application first instruct its container to reduce the resource usage and then inform the platform? The reason this is important is if it doesn't happen that way, node will forcefully either kill it when monitoring resource usage or change its cgroup immediately causing the container to swap. Also, I can see that some of the scheduler changes are going to be pretty involved. I'd also vote for a branch. A couple of patches already went in and I'm not even sure we already got them right and/or if they need more revisions as we start making core changes. To avoid branch-rot, we could target a subset, say just the resource-increase changes in the branch and do the remaining work on trunk after merge.
          Hide
          Wangda Tan (No longer used) added a comment -

          Agree, I also think the scheduler part need some time to review, I'll create a Jira for scheduler part and upload patch(updated against YARN-1447 and YARN-1448)/design doc ASAP.

          Show
          Wangda Tan (No longer used) added a comment - Agree, I also think the scheduler part need some time to review, I'll create a Jira for scheduler part and upload patch(updated against YARN-1447 and YARN-1448 )/design doc ASAP.
          Hide
          Arun C Murthy added a comment -

          The problem is that we can't ship half of this feature in 2.4 - it's either in or out. So, a branch would be significantly better - it's either in or out for 2.4.

          Show
          Arun C Murthy added a comment - The problem is that we can't ship half of this feature in 2.4 - it's either in or out. So, a branch would be significantly better - it's either in or out for 2.4.
          Hide
          Bikas Saha added a comment -

          There are some plumbing/infra related changes which we could commit to trunk safely. None of that would be executed until some scheduler actually supports this. When that happens we could decide to move the code to branch-2 to target a release. Would prefer that to a branch which would need maintenance.

          Show
          Bikas Saha added a comment - There are some plumbing/infra related changes which we could commit to trunk safely. None of that would be executed until some scheduler actually supports this. When that happens we could decide to move the code to branch-2 to target a release. Would prefer that to a branch which would need maintenance.
          Hide
          Arun C Murthy added a comment -

          Sorry, to come in late - I'm +1 for the overall idea/approach.

          However, I feel we still have to work through details on the scheduler side. So, I'd like to see this developed in a branch. This would allow for a full picture to emerge before we commit it to a specific release 2.4 v/s 2.5 etc. Thoughts?

          Show
          Arun C Murthy added a comment - Sorry, to come in late - I'm +1 for the overall idea/approach. However, I feel we still have to work through details on the scheduler side. So, I'd like to see this developed in a branch. This would allow for a full picture to emerge before we commit it to a specific release 2.4 v/s 2.5 etc. Thoughts?
          Wangda Tan (No longer used) made changes -
          Attachment yarn-1197-v5.pdf [ 12617834 ]
          Hide
          Wangda Tan (No longer used) added a comment -

          I attached an updated design doc according to our discussion.

          Show
          Wangda Tan (No longer used) added a comment - I attached an updated design doc according to our discussion.
          Hide
          Wangda Tan (No longer used) added a comment -

          Guys, I attached patch for YARN-1448 and YARN-1449 for review (updated according to YARN-1447), hope to get your ideas.
          Thanks

          Show
          Wangda Tan (No longer used) added a comment - Guys, I attached patch for YARN-1448 and YARN-1449 for review (updated according to YARN-1447 ), hope to get your ideas. Thanks
          Hide
          Wangda Tan (No longer used) added a comment -

          +1 for Junping's idea,
          I'm not sure if we can get low-latency container increase via resource over-commitment which needed by many applications.

          Show
          Wangda Tan (No longer used) added a comment - +1 for Junping's idea, I'm not sure if we can get low-latency container increase via resource over-commitment which needed by many applications.
          Hide
          Junping Du added a comment -

          I think this is related to YARN-1011 so that we can provide resource over-commitment in YARN and achieve better resource elasticity.

          Show
          Junping Du added a comment - I think this is related to YARN-1011 so that we can provide resource over-commitment in YARN and achieve better resource elasticity.
          Junping Du made changes -
          Link This issue relates to YARN-1011 [ YARN-1011 ]
          Wangda Tan (No longer used) made changes -
          Description The current YARN resource management logic assumes resource allocated to a container is fixed during the lifetime of it. When users want to change a resource
          of an allocated container the only way is releasing it and allocating a new container with expected size.
          Allowing run-time changing resources of an allocated container will give us better control of resource usage in application side
          Wangda Tan (No longer used) made changes -
          Description Currently, YARN cannot support merge several containers in one node to a big container, which can make us incrementally ask resources, merge them to a bigger one, and launch our processes. The user scenario is described in the comments.
          Hide
          Wangda Tan (No longer used) added a comment -

          Got it, I'll let you know if I have any questions

          Show
          Wangda Tan (No longer used) added a comment - Got it, I'll let you know if I have any questions
          Hide
          Bikas Saha added a comment -

          You can make them sub-tasks of this jira. Use More->Create-sub-tasks from the menu items on top.

          Show
          Bikas Saha added a comment - You can make them sub-tasks of this jira. Use More->Create-sub-tasks from the menu items on top.
          Hide
          Wangda Tan (No longer used) added a comment -

          OK, I'll create a series of JIRAs for easier review/commit, thanks for your advice!

          Show
          Wangda Tan (No longer used) added a comment - OK, I'll create a series of JIRAs for easier review/commit, thanks for your advice!
          Hide
          Sandy Ryza added a comment -

          +1 to what Bikas said. Would also be good to have a separate JIRA for the scheduler changes.

          Show
          Sandy Ryza added a comment - +1 to what Bikas said. Would also be good to have a separate JIRA for the scheduler changes.
          Hide
          Bikas Saha added a comment -

          This is great. However, for efficient review/commit we will need to break these down into different jiras and attach patches to them. That will force us to look at each change by itself (using this jira for context) and more importantly, make sure each logically distinct piece is complete in itself. It should have its own tests, and pass Jenkins build/test independently so that it can be committed independently.
          Can you please do that and determine the order in which those jiras should be reviewed/committed. e.g. the API and protobuf changes should be a separate jira for RM and NM protocols. That should probably be the first jira we should review. And so on.

          Show
          Bikas Saha added a comment - This is great. However, for efficient review/commit we will need to break these down into different jiras and attach patches to them. That will force us to look at each change by itself (using this jira for context) and more importantly, make sure each logically distinct piece is complete in itself. It should have its own tests, and pass Jenkins build/test independently so that it can be committed independently. Can you please do that and determine the order in which those jiras should be reviewed/committed. e.g. the API and protobuf changes should be a separate jira for RM and NM protocols. That should probably be the first jira we should review. And so on.
          Wangda Tan (No longer used) made changes -
          Attachment mapreduce-project.patch.ver.1 [ 12615527 ]
          Attachment tools-project.patch.ver.1 [ 12615528 ]
          Attachment yarn-api-protocol.patch.ver.1 [ 12615529 ]
          Attachment yarn-pb-impl.patch.ver.1 [ 12615530 ]
          Attachment yarn-server-common.patch.ver.1 [ 12615531 ]
          Attachment yarn-server-nodemanager.patch.ver.1 [ 12615532 ]
          Attachment yarn-server-resourcemanager.patch.ver.1 [ 12615533 ]
          Hide
          Wangda Tan (No longer used) added a comment -

          I just finished container resource increase support including PB/API changes, make capacity scheduler support increasing and NM can support change monitoring size of a running container.

          I splitted it to several patches for easier review,

          • API/pb file changes in hadoop-yarn-api
          • PB implementations in hadoop-yarn-common
          • yarn-server-common changes
          • yarn-server-resourcemanager changes include capacity scheduler and AMS master, etc. changes
          • yarn-server-nodemanager changes include ContainerManagerImpl and ContainersMonitor changes
          • other related project changes according to updated APIs (map-reduce/tools)
            Aboves a preview patches, still very rough, Bikas Saha, Sandy Ryza, Alejandro Abdelnur , Vinod Kumar Vavilapalli could you please do some review on them, I'm eager for your ideas!

          And some short notes for current implementations on RM/NM not covered in design doc,
          1) Implementation in capacity scheduler for increasing a container size
          It's very close to allocate a new container, some details,

          • Increase request can be only valid when asked size larger than existed resource, and container state is either RUNNING or ACQUIRED
          • The entry point of increase request allocation is still in CapacityScheduler:nodeUpdate()
          • When increase request cannot be allocated, it will also be reserved. Each node can only reserve at most one request (increase request or new container request). I created a new method isReserved() in FiCaSchedulerNode to make scheduler/queue identify if a node is reserved
          • The major logic for increase request allocation is also placed in LeafQueue:assignContainers, increase requests will be proceeded before new container request.
          • Queue(leaf/parent) capacity and user capacity checking will also be done before reserve or allocate a increase request
          • Queue(leaf/parent) used resource will also be deduct when increase request reserved
          • Users may submit increase request several times on a same container with different size.
            • If asked size is equal to previous asked size, it will be ignored
            • If asked size is smaller or equal to existed size, this will cancel increase request on this container
            • If asked size is different of previous asked size, and greater than existing size, it will replace previous ask and cancel previous reservations (if existed).

          2) Implementation in node manager for increasing a container size

          • It will do a similar check logic (like token verifications, etc.) like start container
          • Increase logic will only valid when ContainerState(The internal ContainerState: org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerState) is RUNNING, to avoid a racing condition.
          • ContainersMonitorImpl will put change requests to containersToBeChanged when received CHANGE_MONITORING_CONTAINER event. And it will be proceeded in MonitoringThread:run()
          Show
          Wangda Tan (No longer used) added a comment - I just finished container resource increase support including PB/API changes, make capacity scheduler support increasing and NM can support change monitoring size of a running container. I splitted it to several patches for easier review, API/pb file changes in hadoop-yarn-api PB implementations in hadoop-yarn-common yarn-server-common changes yarn-server-resourcemanager changes include capacity scheduler and AMS master, etc. changes yarn-server-nodemanager changes include ContainerManagerImpl and ContainersMonitor changes other related project changes according to updated APIs (map-reduce/tools) Aboves a preview patches, still very rough, Bikas Saha , Sandy Ryza , Alejandro Abdelnur , Vinod Kumar Vavilapalli could you please do some review on them, I'm eager for your ideas! And some short notes for current implementations on RM/NM not covered in design doc, 1) Implementation in capacity scheduler for increasing a container size It's very close to allocate a new container, some details, Increase request can be only valid when asked size larger than existed resource, and container state is either RUNNING or ACQUIRED The entry point of increase request allocation is still in CapacityScheduler:nodeUpdate() When increase request cannot be allocated, it will also be reserved. Each node can only reserve at most one request (increase request or new container request). I created a new method isReserved() in FiCaSchedulerNode to make scheduler/queue identify if a node is reserved The major logic for increase request allocation is also placed in LeafQueue:assignContainers, increase requests will be proceeded before new container request. Queue(leaf/parent) capacity and user capacity checking will also be done before reserve or allocate a increase request Queue(leaf/parent) used resource will also be deduct when increase request reserved Users may submit increase request several times on a same container with different size. If asked size is equal to previous asked size, it will be ignored If asked size is smaller or equal to existed size, this will cancel increase request on this container If asked size is different of previous asked size, and greater than existing size, it will replace previous ask and cancel previous reservations (if existed). 2) Implementation in node manager for increasing a container size It will do a similar check logic (like token verifications, etc.) like start container Increase logic will only valid when ContainerState(The internal ContainerState: org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerState) is RUNNING, to avoid a racing condition. ContainersMonitorImpl will put change requests to containersToBeChanged when received CHANGE_MONITORING_CONTAINER event. And it will be proceeded in MonitoringThread:run()
          Hide
          Wangda Tan (No longer used) added a comment -

          Sandy Ryza
          These bunch of code will be hard for review, but I will make and upload a all-in-one draft patch for early preview before split it. I hope I can finish the alpha patch during this weekend :-]
          I agree you mentioned YARN enforce memory monitoring via ContainersMonitor, I ask this because I noticed the design doc of YARN-3, https://issues.apache.org/jira/secure/attachment/12538426/mapreduce-4334-design-doc-v2.txt, the "Design 2(b)" mentioned "ResoureEnforcer.init()", before create cgroups. I found this is not done yet, so we can safely update the resource enforce via update limit of ContainersMonitor now.

          Show
          Wangda Tan (No longer used) added a comment - Sandy Ryza These bunch of code will be hard for review, but I will make and upload a all-in-one draft patch for early preview before split it. I hope I can finish the alpha patch during this weekend :-] I agree you mentioned YARN enforce memory monitoring via ContainersMonitor, I ask this because I noticed the design doc of YARN-3 , https://issues.apache.org/jira/secure/attachment/12538426/mapreduce-4334-design-doc-v2.txt , the "Design 2(b)" mentioned "ResoureEnforcer.init()", before create cgroups. I found this is not done yet, so we can safely update the resource enforce via update limit of ContainersMonitor now.
          Hide
          Sandy Ryza added a comment -

          +1 on the current design. If possible it would be best to split this up between a couple JIRAs - one for the increase and one for the decrease. We could do a third for the common proto definitions if necessary.

          I believe that YARN does memory enforcement in the same way whether or not the LCE is used. I.e. it monitors the container process and kills it if its memory consumption goes over the configured limit.

          Show
          Sandy Ryza added a comment - +1 on the current design. If possible it would be best to split this up between a couple JIRAs - one for the increase and one for the decrease. We could do a third for the common proto definitions if necessary. I believe that YARN does memory enforcement in the same way whether or not the LCE is used. I.e. it monitors the container process and kills it if its memory consumption goes over the configured limit.
          Wangda Tan (No longer used) made changes -
          Attachment yarn-1197-v4.pdf [ 12614261 ]
          Hide
          Wangda Tan (No longer used) added a comment -

          I updated design doc mainly based on Bikas Saha latest comment,
          I left "Add implementation to NM to support changing resource isolation limitation of a container" blank in design doc. From my understanding, currently, LCE doesn't do any memory isolation (right?), it will setup CPU share when it setup a container. Currently I don't have whole picture of these component, solutions I can think are,
          1) Because this is a soft limitation, we just don't do anything in LCE side
          2) Create a new interface in ContainerExecutor, which can update resource of a container
          3) Disable increase/shrink when LCE is selected
          Hope to get your more opinions on this and design doc.

          Show
          Wangda Tan (No longer used) added a comment - I updated design doc mainly based on Bikas Saha latest comment, I left "Add implementation to NM to support changing resource isolation limitation of a container" blank in design doc. From my understanding, currently, LCE doesn't do any memory isolation (right?), it will setup CPU share when it setup a container. Currently I don't have whole picture of these component, solutions I can think are, 1) Because this is a soft limitation, we just don't do anything in LCE side 2) Create a new interface in ContainerExecutor, which can update resource of a container 3) Disable increase/shrink when LCE is selected Hope to get your more opinions on this and design doc.
          Hide
          Wangda Tan (No longer used) added a comment -

          Bikas Saha
          Actually I was on half way of implementing these and stopped by other works. :-/

          For putting increasing request to ResourceRequest,
          I agree, I really spent some time (the half-baked scheduler supporting increase) to prove putting increasing request to resource request is NOT good, even if you mentioned it before . The original reason I put increasing request to ResourceRequest because in literally speaking, the increasing request is another form of "resource request", it also ask for more resource, the only difference is increasing request add a restriction on the request.
          But in real YARN's implementation, it's problematic to make it being part of resource request, I need to handle increasing cases everywhere in RM. I think making it a new member in AllocateRequest is cleaner solution, but potentially, it will cause more interfaces/implements changes (like SchedulerApplication, YARNScheduler, etc.). I'll continue look at it before starting write code.

          I also agree for you comments for improving representation of ChangeContainerResourceResponse and the missed ResourceIncreaseContextProto in AllocateResponse. I'll add my design proposal for handle new resource in monitoring module.

          Again, your comments are really helpful, hope to get your more ideas

          Show
          Wangda Tan (No longer used) added a comment - Bikas Saha Actually I was on half way of implementing these and stopped by other works. :-/ For putting increasing request to ResourceRequest, I agree, I really spent some time (the half-baked scheduler supporting increase) to prove putting increasing request to resource request is NOT good, even if you mentioned it before . The original reason I put increasing request to ResourceRequest because in literally speaking, the increasing request is another form of "resource request", it also ask for more resource, the only difference is increasing request add a restriction on the request. But in real YARN's implementation, it's problematic to make it being part of resource request, I need to handle increasing cases everywhere in RM. I think making it a new member in AllocateRequest is cleaner solution, but potentially, it will cause more interfaces/implements changes (like SchedulerApplication, YARNScheduler, etc.). I'll continue look at it before starting write code. I also agree for you comments for improving representation of ChangeContainerResourceResponse and the missed ResourceIncreaseContextProto in AllocateResponse. I'll add my design proposal for handle new resource in monitoring module. Again, your comments are really helpful, hope to get your more ideas
          Hide
          Bikas Saha added a comment -

          Can we do with just change_succeeded and change_failed lists instead of 4 lists. Using the containerId, the AM can determine which one was increase/decrease.

          +message	ChangeContainersResourceResponseProto	{	
          +		repeated	ContainerIdProto	succeed_increased_containers	=	1;	
          +		repeated	ContainerIdProto	succeed_decreased_containers	=	2;	
          +		repeated	ContainerIdProto	failed_increased_containers	=	3;	
          +		repeated	ContainerIdProto	failed_decreased_containers	=	4;		
          +}
          

          I dont think its correct for ResourceRequest to be used to increase resources for an allocated container. I was expecting a new optional repeated field of type ResourceChangeContextProto in AllocateRequest. For requesting increase in container C's resource, the AM will add a ResourceChangeContextProto for that container in the next AllocateRequest.

          In AllocateResponse, the type of increased container should be ResourceIncreaseContextProto, right? Without that the AM cannot get the new container token for that container.

          The NM changes also need to handle enforcing the new resource via cgroups etc in addition to changing the monitoring. This needs to be clarified in the document.

          Show
          Bikas Saha added a comment - Can we do with just change_succeeded and change_failed lists instead of 4 lists. Using the containerId, the AM can determine which one was increase/decrease. +message ChangeContainersResourceResponseProto { + repeated ContainerIdProto succeed_increased_containers = 1; + repeated ContainerIdProto succeed_decreased_containers = 2; + repeated ContainerIdProto failed_increased_containers = 3; + repeated ContainerIdProto failed_decreased_containers = 4; +} I dont think its correct for ResourceRequest to be used to increase resources for an allocated container. I was expecting a new optional repeated field of type ResourceChangeContextProto in AllocateRequest. For requesting increase in container C's resource, the AM will add a ResourceChangeContextProto for that container in the next AllocateRequest. In AllocateResponse, the type of increased container should be ResourceIncreaseContextProto, right? Without that the AM cannot get the new container token for that container. The NM changes also need to handle enforcing the new resource via cgroups etc in addition to changing the monitoring. This needs to be clarified in the document.
          Hide
          Bikas Saha added a comment -

          Wangda, sorry for the delayed response. Was caught up with other work. I will take a look at the new proposal. Vinod Kumar Vavilapalli Can you please take a look at the latest proposal?

          Show
          Bikas Saha added a comment - Wangda, sorry for the delayed response. Was caught up with other work. I will take a look at the new proposal. Vinod Kumar Vavilapalli Can you please take a look at the latest proposal?
          Wangda Tan (No longer used) made changes -
          Attachment yarn-1197-v3.pdf [ 12607745 ]
          Wangda Tan (No longer used) made changes -
          Assignee Wangda Tan [ gp.leftnoteasy ]
          Wangda Tan (No longer used) made changes -
          Attachment yarn-1197-v2.pdf [ 12606770 ]
          Hide
          Wangda Tan (No longer used) added a comment -

          Guys, I attached an updated doc basic on our discussion, mainly focused workflow diagram and detailed API changes,
          many thanks to Bikas Saha, Sandy Ryza and Alejandro Abdelnur. Hope to get your feedback. I'll start working on it.

          Show
          Wangda Tan (No longer used) added a comment - Guys, I attached an updated doc basic on our discussion, mainly focused workflow diagram and detailed API changes, many thanks to Bikas Saha , Sandy Ryza and Alejandro Abdelnur . Hope to get your feedback. I'll start working on it.
          Hide
          Alejandro Abdelnur added a comment -

          Bikas, yep, there is a race condition the AM->RM->NM for decreasing. At least in the FS due to continuous scheduling (YARN-1010) because the RM could allocate the freed space to an AM before the NM heartbeats and gets the info. This does not happen if allocations are tied to the corresponding NM heartbeating. Thanks.

          Show
          Alejandro Abdelnur added a comment - Bikas, yep, there is a race condition the AM->RM->NM for decreasing. At least in the FS due to continuous scheduling ( YARN-1010 ) because the RM could allocate the freed space to an AM before the NM heartbeats and gets the info. This does not happen if allocations are tied to the corresponding NM heartbeating. Thanks.
          Hide
          Sandy Ryza added a comment -

          To summarize what I wrote above: YARN is already asymmetrical wrt acquiring and releasing resources. I don't think the minimum allocation logic is enough to justify a round trip to the RM. It will require adding more new states that will make the whole thing more confusing and bug-prone. We can either push down this logic into the NodeManager or just handle it on the RM side, i.e. refuse to free any resources in the scheduler for a container that decreases from 1024 to 1023 mb.

          Show
          Sandy Ryza added a comment - To summarize what I wrote above: YARN is already asymmetrical wrt acquiring and releasing resources. I don't think the minimum allocation logic is enough to justify a round trip to the RM. It will require adding more new states that will make the whole thing more confusing and bug-prone. We can either push down this logic into the NodeManager or just handle it on the RM side, i.e. refuse to free any resources in the scheduler for a container that decreases from 1024 to 1023 mb.
          Hide
          Bikas Saha added a comment -

          Sandy had some arguments on why this has race conditions wrt when the RM can start allocating the freed-up resources. Can you please look at the comments above to check if its the same thing or not.

          Show
          Bikas Saha added a comment - Sandy had some arguments on why this has race conditions wrt when the RM can start allocating the freed-up resources. Can you please look at the comments above to check if its the same thing or not.
          Hide
          Alejandro Abdelnur added a comment -

          Bikas, makes sense, thanks for summarizing.

          On the decreasing. Given that we also do a round loop AM->NM->RM->AM, Why not make it a bit more symmetric

          *AM asks RM to decrease a container
          *RM notifies NM on next heartbeat about container decreasing

          With this approach the RM can enforce the MIN on AM decrease and reject it if below MIN. Also, there is not need to notify the AM of the decrease taking place as the AM requested that. And as it is a decrease the AM can instruct the container to shrink even if the RM does not told the NM yet. Furthermore, I would expect an AM instructs a container to shrink before asking Yarn to avoid a race condition that could kill the container for using more resources than it should.

          Also, by doing this there would not be difference in the free resources bookkeeping in the RM and the NMs. Thing that may be handy not to complicate things for YARN-311.

          Thoughts?

          Show
          Alejandro Abdelnur added a comment - Bikas, makes sense, thanks for summarizing. On the decreasing. Given that we also do a round loop AM->NM->RM->AM, Why not make it a bit more symmetric *AM asks RM to decrease a container *RM notifies NM on next heartbeat about container decreasing With this approach the RM can enforce the MIN on AM decrease and reject it if below MIN. Also, there is not need to notify the AM of the decrease taking place as the AM requested that. And as it is a decrease the AM can instruct the container to shrink even if the RM does not told the NM yet. Furthermore, I would expect an AM instructs a container to shrink before asking Yarn to avoid a race condition that could kill the container for using more resources than it should. Also, by doing this there would not be difference in the free resources bookkeeping in the RM and the NMs. Thing that may be handy not to complicate things for YARN-311 . Thoughts?
          Hide
          Bikas Saha added a comment -

          Alejandro, we have already iterated through the create new container and merge with old container approach and saw no advantage over a cleaner api that asks for increase in an existing containers request. Internal implementations of schedulers can choose to model this as a new allocation followed by an internal merge instead of exposing a new container to the external environment only to have it merged back by the AM. This is more burden on the AM, NM and confusing to tracking services like logs/AHS. Also the delta container request would have to specify specific locality to place it exactly where the existing container is or else the RM can allocate the delta anywhere. This should just happen automatically if the request was to increase an existing containers capacity.

          Given the above, here is the current approach based on all the comments above

          To increase container resource
          1) AM ask RM to increase resources
          2) RM increases the resources and provides a new token to sign that increase
          3) AM calls new NM API to increase the resources, signed with above token

          To decrease container resource
          1) AM ask NM to decrease the resource
          2) NM tell RM about this
          3) RM informs the AM about the change and allocates free space somewhere else

          The up side of this is that it avoids race conditions. The down side is that the flow is different in both cases. Another downside is that the NM does not know if the reduced size is going to be unsupported by the scheduler. IMO, its not a good idea for RM scheduler internals to leak into the NM and NM should not get involved in those kind of decisions. One simple solution would be that when the NM tells the RM about reduced resources, then the RM only reduces its internal book-keeping by the amount until min-allocation is reached. The extra freed resources would sit unused in the node but they would anyways be unused if min-allocation was being enforced by the NM. So we are not worse off. As far as the AM is concerned it already promised to use stay within the reduced capacity and so this shouldnt affect it.

          Show
          Bikas Saha added a comment - Alejandro, we have already iterated through the create new container and merge with old container approach and saw no advantage over a cleaner api that asks for increase in an existing containers request. Internal implementations of schedulers can choose to model this as a new allocation followed by an internal merge instead of exposing a new container to the external environment only to have it merged back by the AM. This is more burden on the AM, NM and confusing to tracking services like logs/AHS. Also the delta container request would have to specify specific locality to place it exactly where the existing container is or else the RM can allocate the delta anywhere. This should just happen automatically if the request was to increase an existing containers capacity. Given the above, here is the current approach based on all the comments above To increase container resource 1) AM ask RM to increase resources 2) RM increases the resources and provides a new token to sign that increase 3) AM calls new NM API to increase the resources, signed with above token To decrease container resource 1) AM ask NM to decrease the resource 2) NM tell RM about this 3) RM informs the AM about the change and allocates free space somewhere else The up side of this is that it avoids race conditions. The down side is that the flow is different in both cases. Another downside is that the NM does not know if the reduced size is going to be unsupported by the scheduler. IMO, its not a good idea for RM scheduler internals to leak into the NM and NM should not get involved in those kind of decisions. One simple solution would be that when the NM tells the RM about reduced resources, then the RM only reduces its internal book-keeping by the amount until min-allocation is reached. The extra freed resources would sit unused in the node but they would anyways be unused if min-allocation was being enforced by the NM. So we are not worse off. As far as the AM is concerned it already promised to use stay within the reduced capacity and so this shouldnt affect it.
          Hide
          Sandy Ryza added a comment -

          Bikas Saha, yeah, that's my suggestion.

          Show
          Sandy Ryza added a comment - Bikas Saha , yeah, that's my suggestion.
          Hide
          Alejandro Abdelnur added a comment - - edited

          Wangda Tan (No longer used), thanks for your previous answer, it makes sense.

          We've been thinking about this a while ago in the context of Llama for Impala-Yarn integration.

          Along the lines of what Sandy suggested, just a couple of extra comments.

          For decreasing AM can request the correction, effective immediately to the NM. the NM reports the container correction and new free space to the RM in the next heartbeat. Regarding enforcing minimum, configuration properties are scheduler specific, so the minimum should have to come to the NM from the RM as part of the registration response.

          For increasing the AM must go to the RM first o avoid the race conditions already mentioned. To reduce the changes in the RM to a minimum I was thinking the following approach:

          • AM does a regular new allocation request with the desired delta capabilities increases with relaxedLocality=false (no changes on the AM-RM protocol/logic).
          • AM waits for the delta container allocation from the RM.
          • When AM receives the delta container allocation, using a new AM-NM API, it updates the original container with the delta container.
          • The NM makes the necessary corrections locally to the original container adding the capabilities o the delta container.
          • The NM notifies the RM to merge the original container with the delta container.
          • The RM updates the original container and drops the delta container.

          The complete list of changes for this approach would be:

          • AM-NM API
            • decreaseContainer(ContainerId original, Resources)
            • increateContainer(ContainerId original, ContainerId delta)
          • NM-RM API
            • decreaseContainer(ContainerId original, Resources)
            • registration() -> +minimumcontainersize
            • mergeContainers(ContainerId originalKeep, ContainerId deltaDiscard)
          • NM logic
            • needs to correct capabilities enforcement for +/- delta
          • RM logic
            • needs to update container resources when receiving a NM's decreaseContainer() call
            • needs to update original container resources and delete delta container resources when receiving a NM's mergeContainer() call
          • RM scheduler API
            • it should expose methods for decreaseContainer() and mergeContainers() functionality
          Show
          Alejandro Abdelnur added a comment - - edited Wangda Tan (No longer used) , thanks for your previous answer, it makes sense. We've been thinking about this a while ago in the context of Llama for Impala-Yarn integration. Along the lines of what Sandy suggested, just a couple of extra comments. For decreasing AM can request the correction, effective immediately to the NM. the NM reports the container correction and new free space to the RM in the next heartbeat. Regarding enforcing minimum, configuration properties are scheduler specific, so the minimum should have to come to the NM from the RM as part of the registration response. For increasing the AM must go to the RM first o avoid the race conditions already mentioned. To reduce the changes in the RM to a minimum I was thinking the following approach: AM does a regular new allocation request with the desired delta capabilities increases with relaxedLocality=false (no changes on the AM-RM protocol/logic). AM waits for the delta container allocation from the RM. When AM receives the delta container allocation, using a new AM-NM API, it updates the original container with the delta container. The NM makes the necessary corrections locally to the original container adding the capabilities o the delta container. The NM notifies the RM to merge the original container with the delta container. The RM updates the original container and drops the delta container. The complete list of changes for this approach would be: AM-NM API decreaseContainer(ContainerId original, Resources) increateContainer(ContainerId original, ContainerId delta) NM-RM API decreaseContainer(ContainerId original, Resources) registration() -> +minimumcontainersize mergeContainers(ContainerId originalKeep, ContainerId deltaDiscard) NM logic needs to correct capabilities enforcement for +/- delta RM logic needs to update container resources when receiving a NM's decreaseContainer() call needs to update original container resources and delete delta container resources when receiving a NM's mergeContainer() call RM scheduler API it should expose methods for decreaseContainer() and mergeContainers() functionality
          Hide
          Bikas Saha added a comment -

          So the suggestion is that increase goes AM(request)>RM(allocation)>AM(increase)>NM and decrease goes AM(decrease)>NM(inform)>RM(consider free)>AM (confirmation from RM similar to completedContainerStatus) ?

          Show
          Bikas Saha added a comment - So the suggestion is that increase goes AM(request) >RM(allocation) >AM(increase) >NM and decrease goes AM(decrease) >NM(inform) >RM(consider free) >AM (confirmation from RM similar to completedContainerStatus) ?
          Hide
          Sandy Ryza added a comment -

          I dont think Sandy meant that the AM first tells the NM to decrease the size and then the NM informs the RM.

          You're right about what I meant. Though thinking about this more, is there any reason a container shrinking needs to get permission from the RM? Should we not treat giving up part of a container in the same way we treat giving up an entire container? I.e. that the app unilaterally decides when to do it. If we need to respect properties like yarn.scheduler.minmum-allocation-mb, the NodeManagers could pick these up and enforce them by rejecting shrinkings.

          The downside is that for the duration of the heartbeat interval, the node may get overbooked but that should not be a problem in practice since the container would already be using a lower value of resources before the AM asked its capacity to be decreased.

          Accepting overbooking in this context seems to me like it would open up a bunch of race conditions and compromise a bunch of useful assumptions an administrator can make about what's running on a node at a given time. Do the uses of container shrinking require such low latency? (which we would also achieve by avoiding the round trip to the RM)

          Show
          Sandy Ryza added a comment - I dont think Sandy meant that the AM first tells the NM to decrease the size and then the NM informs the RM. You're right about what I meant. Though thinking about this more, is there any reason a container shrinking needs to get permission from the RM? Should we not treat giving up part of a container in the same way we treat giving up an entire container? I.e. that the app unilaterally decides when to do it. If we need to respect properties like yarn.scheduler.minmum-allocation-mb, the NodeManagers could pick these up and enforce them by rejecting shrinkings. The downside is that for the duration of the heartbeat interval, the node may get overbooked but that should not be a problem in practice since the container would already be using a lower value of resources before the AM asked its capacity to be decreased. Accepting overbooking in this context seems to me like it would open up a bunch of race conditions and compromise a bunch of useful assumptions an administrator can make about what's running on a node at a given time. Do the uses of container shrinking require such low latency? (which we would also achieve by avoiding the round trip to the RM)
          Hide
          Wangda Tan (No longer used) added a comment -

          For decreasing resources, if the RM is to consider the free resource available only after the AM informs the NM and the NM heartbeats with the RM then this change may become more complicated since the current schedulers dont expect any lag in their allocations. This will also delay the allocation of the free space to others. Also this delay is determined by when the AM syncs with the NM. Thats not a good property. We should probably assume the decrease to be effective immediately and RM-NM sync should enforce that. The downside is that for the duration of the heartbeat interval, the node may get overbooked but that should not be a problem in practice since the container would already be using a lower value of resources before the AM asked its capacity to be decreased.

          I think it make sense, AM tell NM first will make RM cannot leverage freed resources, it's not good for heavy-loaded cluster. I'll update document as our discussion and start break down tasks. Please let me know if you have any other comments.

          Show
          Wangda Tan (No longer used) added a comment - For decreasing resources, if the RM is to consider the free resource available only after the AM informs the NM and the NM heartbeats with the RM then this change may become more complicated since the current schedulers dont expect any lag in their allocations. This will also delay the allocation of the free space to others. Also this delay is determined by when the AM syncs with the NM. Thats not a good property. We should probably assume the decrease to be effective immediately and RM-NM sync should enforce that. The downside is that for the duration of the heartbeat interval, the node may get overbooked but that should not be a problem in practice since the container would already be using a lower value of resources before the AM asked its capacity to be decreased. I think it make sense, AM tell NM first will make RM cannot leverage freed resources, it's not good for heavy-loaded cluster. I'll update document as our discussion and start break down tasks. Please let me know if you have any other comments.
          Hide
          Bikas Saha added a comment -

          I dont think Sandy meant that the AM first tells the NM to decrease the size and then the NM informs the RM. He meant AM asks the RM. The RM decreases/increases the size and then the AM informs the NM about the change. RM-NM communication via heartbeat that may happen after some time.

          For decreasing resources, if the RM is to consider the free resource available only after the AM informs the NM and the NM heartbeats with the RM then this change may become more complicated since the current schedulers dont expect any lag in their allocations. This will also delay the allocation of the free space to others. Also this delay is determined by when the AM syncs with the NM. Thats not a good property. We should probably assume the decrease to be effective immediately and RM-NM sync should enforce that. The downside is that for the duration of the heartbeat interval, the node may get overbooked but that should not be a problem in practice since the container would already be using a lower value of resources before the AM asked its capacity to be decreased.
          The same problem does not hold for increasing resources.

          Show
          Bikas Saha added a comment - I dont think Sandy meant that the AM first tells the NM to decrease the size and then the NM informs the RM. He meant AM asks the RM. The RM decreases/increases the size and then the AM informs the NM about the change. RM-NM communication via heartbeat that may happen after some time. For decreasing resources, if the RM is to consider the free resource available only after the AM informs the NM and the NM heartbeats with the RM then this change may become more complicated since the current schedulers dont expect any lag in their allocations. This will also delay the allocation of the free space to others. Also this delay is determined by when the AM syncs with the NM. Thats not a good property. We should probably assume the decrease to be effective immediately and RM-NM sync should enforce that. The downside is that for the duration of the heartbeat interval, the node may get overbooked but that should not be a problem in practice since the container would already be using a lower value of resources before the AM asked its capacity to be decreased. The same problem does not hold for increasing resources.
          Hide
          Wangda Tan (No longer used) added a comment -

          Method mentioned by Sandy Ryza can solve "cheating" problem in AM side. AM doesn't need tell RM to decrease container size at all, just tell NM, and let NM telling RM by heartbeat, the only problem is order of second latency to decrease in scheduler side, shouldn't be a big problem.
          And the race problem mentioned by Bikas Saha, I think it should be reasonable, when an AM request for more resource on a container, it never know when RM will return it back. So AM may need to use the smaller resource to launch container, this is not harmful to either scheduler or NM (use less resource is not a problem). After sometime, AM will get allocated resource, it can tell NM and child process to increase memory quota. Do you agree about this?

          Show
          Wangda Tan (No longer used) added a comment - Method mentioned by Sandy Ryza can solve "cheating" problem in AM side. AM doesn't need tell RM to decrease container size at all, just tell NM, and let NM telling RM by heartbeat, the only problem is order of second latency to decrease in scheduler side, shouldn't be a big problem. And the race problem mentioned by Bikas Saha , I think it should be reasonable, when an AM request for more resource on a container, it never know when RM will return it back. So AM may need to use the smaller resource to launch container, this is not harmful to either scheduler or NM (use less resource is not a problem). After sometime, AM will get allocated resource, it can tell NM and child process to increase memory quota. Do you agree about this?
          Hide
          Sandy Ryza added a comment -

          It seems to me that for the reasons Bikas mentioned and for consistency with the way container launch is done, the AM should be the one who tells the NM to do the resize. If resources are released, then the NM would tell the RM about the newly free space on its next heartbeat after the resize has completed. Only then would the scheduler consider those resources available.

          Show
          Sandy Ryza added a comment - It seems to me that for the reasons Bikas mentioned and for consistency with the way container launch is done, the AM should be the one who tells the NM to do the resize. If resources are released, then the NM would tell the RM about the newly free space on its next heartbeat after the resize has completed. Only then would the scheduler consider those resources available.
          Hide
          Wangda Tan (No longer used) added a comment -

          To Bikas Saha,
          Thanks for your comments, see my opinions below,

          Still thinking through the RM-NM interactions. The request for change should probably be a new object that is basically a map of (containerId, Resource) where Resource is new value for the existing containerId. Not quite sure how we would use the new container token for a running container since its only used in start container.

          Agree, we need to update interface of YarnScheduler.allocate to accept this as a paramter if we make request for change independent.
          And as you mentioned below, we can use the new token to update NM's resource monitoring limitations of containers.

          If we wait for RM to sync with NM about the increased resources then it might be too slow since this happens on a heartbeat and the heartbeat interval can be in the order of seconds. An alternative would be a new NM API to allow AM's to increase resources and this would be signed with new container token. But this would burden the AMs by requiring them to make that additional call.

          Agree, this is much more time-effective than RM-NM communications. Yes, it's a cost for both AM/NM for changing container size, but AM should be self-discipline not do this too frequent.

          There could be a race between a new container token coming in with increased resources for an acquired container and the old container token being used by the NMClient to launch the container (in case the AM decides to launch the smaller container while it was waiting for an increase).

          Hmmm... thanks for reminding, this is really a problem. I find another issue is AM may "lie" to RM/NM about resource usage, AM can
          1) allocate a big container, launch it
          2) ask for decrease the container, RM released resource in corresponding node/application
          3) but AM doesn't tell NM about this decrease, it can still use resource before releasing in the container

          I don't have a good idea to solve such problem now. Hope to get more idea from you about this, I will think it through as well.

          Show
          Wangda Tan (No longer used) added a comment - To Bikas Saha , Thanks for your comments, see my opinions below, Still thinking through the RM-NM interactions. The request for change should probably be a new object that is basically a map of (containerId, Resource) where Resource is new value for the existing containerId. Not quite sure how we would use the new container token for a running container since its only used in start container. Agree, we need to update interface of YarnScheduler.allocate to accept this as a paramter if we make request for change independent. And as you mentioned below, we can use the new token to update NM's resource monitoring limitations of containers. If we wait for RM to sync with NM about the increased resources then it might be too slow since this happens on a heartbeat and the heartbeat interval can be in the order of seconds. An alternative would be a new NM API to allow AM's to increase resources and this would be signed with new container token. But this would burden the AMs by requiring them to make that additional call. Agree, this is much more time-effective than RM-NM communications. Yes, it's a cost for both AM/NM for changing container size, but AM should be self-discipline not do this too frequent. There could be a race between a new container token coming in with increased resources for an acquired container and the old container token being used by the NMClient to launch the container (in case the AM decides to launch the smaller container while it was waiting for an increase). Hmmm... thanks for reminding, this is really a problem. I find another issue is AM may "lie" to RM/NM about resource usage, AM can 1) allocate a big container, launch it 2) ask for decrease the container, RM released resource in corresponding node/application 3) but AM doesn't tell NM about this decrease, it can still use resource before releasing in the container I don't have a good idea to solve such problem now. Hope to get more idea from you about this, I will think it through as well.
          Hide
          Wangda Tan (No longer used) added a comment -

          To Alejandro Abdelnur,
          I think the heap size change (Xmx, etc.) of running JVM based container is not totally related to this topic. If user want to change a JVM-based container size, he/she may use a "watcher" process launch the "worker" process in a container, and relaunch the "worker" process with different JVM parameters if needed.
          In a word, if we cannot solve this in language side, we can solve it in application side.

          Show
          Wangda Tan (No longer used) added a comment - To Alejandro Abdelnur , I think the heap size change (Xmx, etc.) of running JVM based container is not totally related to this topic. If user want to change a JVM-based container size, he/she may use a "watcher" process launch the "worker" process in a container, and relaunch the "worker" process with different JVM parameters if needed. In a word, if we cannot solve this in language side, we can solve it in application side.
          Hide
          Bikas Saha added a comment -

          Still thinking through the RM-NM interactions. The request for change should probably be a new object that is basically a map of (containerId, Resource) where Resource is new value for the existing containerId. Not quite sure how we would use the new container token for a running container since its only used in start container.
          If we wait for RM to sync with NM about the increased resources then it might be too slow since this happens on a heartbeat and the heartbeat interval can be in the order of seconds. An alternative would be a new NM API to allow AM's to increase resources and this would be signed with new container token. But this would burden the AMs by requiring them to make that additional call.
          There could be a race between a new container token coming in with increased resources for an acquired container and the old container token being used by the NMClient to launch the container (in case the AM decides to launch the smaller container while it was waiting for an increase).

          Show
          Bikas Saha added a comment - Still thinking through the RM-NM interactions. The request for change should probably be a new object that is basically a map of (containerId, Resource) where Resource is new value for the existing containerId. Not quite sure how we would use the new container token for a running container since its only used in start container. If we wait for RM to sync with NM about the increased resources then it might be too slow since this happens on a heartbeat and the heartbeat interval can be in the order of seconds. An alternative would be a new NM API to allow AM's to increase resources and this would be signed with new container token. But this would burden the AMs by requiring them to make that additional call. There could be a race between a new container token coming in with increased resources for an acquired container and the old container token being used by the NMClient to launch the container (in case the AM decides to launch the smaller container while it was waiting for an increase).
          Hide
          Alejandro Abdelnur added a comment -

          for my own education, how are you planning to reduce/increase the heap size of a running JVM base container?

          Show
          Alejandro Abdelnur added a comment - for my own education, how are you planning to reduce/increase the heap size of a running JVM base container?
          Hide
          Bikas Saha added a comment -

          Sorry for the delay. I will try to get to this soon.

          Show
          Bikas Saha added a comment - Sorry for the delay. I will try to get to this soon.
          Wangda Tan (No longer used) made changes -
          Attachment yarn-1197.pdf [ 12604489 ]
          Hide
          Wangda Tan (No longer used) added a comment -

          Added a initial proposal for it, include increase/decrease a aquired or running container, hope anybody can help me review it. Then we can move forward to break down tasks and start work on it. Thanks.

          Show
          Wangda Tan (No longer used) added a comment - Added a initial proposal for it, include increase/decrease a aquired or running container, hope anybody can help me review it. Then we can move forward to break down tasks and start work on it. Thanks.
          Hide
          Wangda Tan (No longer used) added a comment -

          I totally agree with you, I'll work out a plan considered increase/decrease an available container (allocated/running) with RM-AM-NM communication, will keep you posted. Thanks.

          Show
          Wangda Tan (No longer used) added a comment - I totally agree with you, I'll work out a plan considered increase/decrease an available container (allocated/running) with RM-AM-NM communication, will keep you posted. Thanks.
          Bikas Saha made changes -
          Assignee Wangda Tan [ gp.leftnoteasy ]
          Bikas Saha made changes -
          Assignee Tan, Wangda [ wangda ] Wangda Tan [ gp.leftnoteasy ]
          Bikas Saha made changes -
          Assignee Tan, Wangda [ wangda ]
          Bikas Saha made changes -
          Summary Support increasing resources of an allocated container Support changing resources of an allocated container
          Hide
          Bikas Saha added a comment -

          I am not quite sure what Exception you mention above. Reservation is the mechanism used by the scheduler to accumulate resources on a machine till a container request can be satisfied. If a different machine becomes free in the meanwhile then the container should be allocated to that machine and the reservation removed from the current machine.
          Hence, if we try to allocate more resources to an acquired container, then it will essentially go through the above cycle again. Thus we should have simply requested the larger container in the first place.
          In your use case, if you know you want a container of size X then you should ask for X. If you dont know exactly what size you want then you can ask for a larger container than needed. Later, while running the container if you realize that the container can be smaller than you can reduce the size of the container. If you realize that the container needs to be even larger then you can increase the size of the container. It will be much easier to reduce the size of a running container than increase the size since we are giving up resources to the RM and that is a straightforward operation. Increasing resource is harder because we have to wait for resources to free up. In both cases, we will need to improve RM-AM-NM communication to inform all parties that the container resources have changed.
          I am changing the title of this jira to dynamically change the resources of an allocated container since the same changes are needed to decrease and increase. Decrease is simpler than increase.

          Show
          Bikas Saha added a comment - I am not quite sure what Exception you mention above. Reservation is the mechanism used by the scheduler to accumulate resources on a machine till a container request can be satisfied. If a different machine becomes free in the meanwhile then the container should be allocated to that machine and the reservation removed from the current machine. Hence, if we try to allocate more resources to an acquired container, then it will essentially go through the above cycle again. Thus we should have simply requested the larger container in the first place. In your use case, if you know you want a container of size X then you should ask for X. If you dont know exactly what size you want then you can ask for a larger container than needed. Later, while running the container if you realize that the container can be smaller than you can reduce the size of the container. If you realize that the container needs to be even larger then you can increase the size of the container. It will be much easier to reduce the size of a running container than increase the size since we are giving up resources to the RM and that is a straightforward operation. Increasing resource is harder because we have to wait for resources to free up. In both cases, we will need to improve RM-AM-NM communication to inform all parties that the container resources have changed. I am changing the title of this jira to dynamically change the resources of an allocated container since the same changes are needed to decrease and increase. Decrease is simpler than increase.
          Hide
          Wangda Tan (No longer used) added a comment -

          Increasing resources for a container while in acquired state is not different from waiting for some more time on the RM and allocating the larger container in the first attempt, right?

          I think there's a little difference here, because waiting for resource for a big container in the first attempt, scheduler will put the the request to "reservedContainer" at FSSchedulableNode or FiCaSchedulerNode. This will be considered as an exception, RM will try to satisfy such reserved container first when many different requests existed at the same time in a same node.
          But if we try to ask more resource in an acquired container, I don't know what's your preferring, do you want to create another "exception" which can put an "acquired container" to *ScheduableNode to make it can get prior proceeded or just simply make the request as a normal resource request?

          Also, the RM starts a timer for each acquired container and expects the container to be launched on the NM before the timer expires. So we dont have too much time for the container to be launched and thus we cannot wait for increasing the resources.

          I don't know if we can refresh(receivePing) the timer for a container when we successfully increased resource for it?

          To be useful, we have to be able to increase the resources of a running container. I agree that its a significant change. So making the change will need a more thorough investigation and clear design proposal.

          Agree! I'd like to help moving this forward, I need investigate and consider end-to-end cases and draft a design proposal for it, once I've some ideas or question, I will let you know

          Thanks

          Show
          Wangda Tan (No longer used) added a comment - Increasing resources for a container while in acquired state is not different from waiting for some more time on the RM and allocating the larger container in the first attempt, right? I think there's a little difference here, because waiting for resource for a big container in the first attempt, scheduler will put the the request to "reservedContainer" at FSSchedulableNode or FiCaSchedulerNode. This will be considered as an exception, RM will try to satisfy such reserved container first when many different requests existed at the same time in a same node. But if we try to ask more resource in an acquired container, I don't know what's your preferring, do you want to create another "exception" which can put an "acquired container" to *ScheduableNode to make it can get prior proceeded or just simply make the request as a normal resource request? Also, the RM starts a timer for each acquired container and expects the container to be launched on the NM before the timer expires. So we dont have too much time for the container to be launched and thus we cannot wait for increasing the resources. I don't know if we can refresh(receivePing) the timer for a container when we successfully increased resource for it? To be useful, we have to be able to increase the resources of a running container. I agree that its a significant change. So making the change will need a more thorough investigation and clear design proposal. Agree! I'd like to help moving this forward, I need investigate and consider end-to-end cases and draft a design proposal for it, once I've some ideas or question, I will let you know Thanks
          Hide
          Bikas Saha added a comment -

          Increasing resources for a container while in acquired state is not different from waiting for some more time on the RM and allocating the larger container in the first attempt, right? Also, the RM starts a timer for each acquired container and expects the container to be launched on the NM before the timer expires. So we dont have too much time for the container to be launched and thus we cannot wait for increasing the resources.
          To be useful, we have to be able to increase the resources of a running container. I agree that its a significant change. So making the change will need a more thorough investigation and clear design proposal. Your help in making this happen is most welcome!

          Show
          Bikas Saha added a comment - Increasing resources for a container while in acquired state is not different from waiting for some more time on the RM and allocating the larger container in the first attempt, right? Also, the RM starts a timer for each acquired container and expects the container to be launched on the NM before the timer expires. So we dont have too much time for the container to be launched and thus we cannot wait for increasing the resources. To be useful, we have to be able to increase the resources of a running container. I agree that its a significant change. So making the change will need a more thorough investigation and clear design proposal. Your help in making this happen is most welcome!
          Hide
          Wangda Tan (No longer used) added a comment -

          Here is the tentative plan:
          To make it simple, we can start with only support increase container resource when container state is "ACQUIRED", because when container is running, we need find a way to inform NM to update the quato of a running process, which will affect lots of parts.

          1. Add a field in ResourceRequest, like set/getExistingContainerId() to indicate the resource will be added to existing container
          2. When the ask-list submitted to YarnScheduler, it will be normalized:
            • When existingContainerId set
              • Check if container with "existingContainerId" in state ACQUIRED, abondon this request if container not exist OR state not ACQUIRED.
              • Check if capability of this container + asked capability > scheduler.maximum-allocation-mb, if yes, normalize ask capability
              • Other fields will be ignored except "capability", and the "priority/host" will use the existing container's priority/host
              • Update resource requests in AppSchedulingInfo
          3. Add "increase existing container resource" support in LeafQueue of capa/fifo scheduler, and FSLeafQueue in fair scheduler. We can treat "increase existing container resource request" as same as other resource request except,
            • Only "node-local" available
            • Don't create new container
          4. After we successfully allocated resource to an existing container, we need refresh its container token, send back to AM. AM can use the new container to launch processes via NM.

          Any thoughts? I wish I could help to move this ahead if we can reach an agreement on a rough idea.

          Show
          Wangda Tan (No longer used) added a comment - Here is the tentative plan: To make it simple, we can start with only support increase container resource when container state is "ACQUIRED", because when container is running, we need find a way to inform NM to update the quato of a running process, which will affect lots of parts. Add a field in ResourceRequest, like set/getExistingContainerId() to indicate the resource will be added to existing container When the ask-list submitted to YarnScheduler, it will be normalized: When existingContainerId set Check if container with "existingContainerId" in state ACQUIRED, abondon this request if container not exist OR state not ACQUIRED. Check if capability of this container + asked capability > scheduler.maximum-allocation-mb, if yes, normalize ask capability Other fields will be ignored except "capability", and the "priority/host" will use the existing container's priority/host Update resource requests in AppSchedulingInfo Add "increase existing container resource" support in LeafQueue of capa/fifo scheduler, and FSLeafQueue in fair scheduler. We can treat "increase existing container resource request" as same as other resource request except, Only "node-local" available Don't create new container After we successfully allocated resource to an existing container, we need refresh its container token, send back to AM. AM can use the new container to launch processes via NM. Any thoughts? I wish I could help to move this ahead if we can reach an agreement on a rough idea.
          Bikas Saha made changes -
          Summary Add container merge support in YARN Support increasing resources of an allocated container
          Bikas Saha made changes -
          Link This issue is depended upon by YARN-896 [ YARN-896 ]
          Hide
          Wangda Tan added a comment -

          Bikas,
          Thanks for your patiently explains, I just took a look at YARN-898, this really helps,
          I agree RM is much easier than AM to do this. And as you pointed, in our case, increase an existing container size is as same as merge containers. I support change this title to something like "increase an existing container size" and link this to YARN-898.

          Show
          Wangda Tan added a comment - Bikas, Thanks for your patiently explains, I just took a look at YARN-898 , this really helps, I agree RM is much easier than AM to do this. And as you pointed, in our case, increase an existing container size is as same as merge containers. I support change this title to something like "increase an existing container size" and link this to YARN-898 .
          Hide
          Bikas Saha added a comment -

          YARN allocates containers and manages them. I dont think it would be possible to change YARN to allow the app to get N containers, merge them and then launch just 1 container.
          Again, the app could accumulate small containers on a node and then merge them or the RM could accumulate resources on a node and assign a container to the app. The end result is the same but IMO the chances of a converging solution are much higher if the RM does it and not the app. (because of point 2 in your comments about small requests)
          For YARN, increasing a running container size would be almost the same as assigning another container on the same node so that the new container may be merged with the running container. So a better feature request is to allow increase in container capacity for an allocated container.

          I would suggest that you look at the long running services jira YARN-896. Please check if the items there cover your scenarios. If not then please consider whether your feature request should actually be allowing increase of resources to an allocated container. If you agree then we can change the title of this jira to reflect that and link it to YARN-896.

          Show
          Bikas Saha added a comment - YARN allocates containers and manages them. I dont think it would be possible to change YARN to allow the app to get N containers, merge them and then launch just 1 container. Again, the app could accumulate small containers on a node and then merge them or the RM could accumulate resources on a node and assign a container to the app. The end result is the same but IMO the chances of a converging solution are much higher if the RM does it and not the app. (because of point 2 in your comments about small requests) For YARN, increasing a running container size would be almost the same as assigning another container on the same node so that the new container may be merged with the running container. So a better feature request is to allow increase in container capacity for an allocated container. I would suggest that you look at the long running services jira YARN-896 . Please check if the items there cover your scenarios. If not then please consider whether your feature request should actually be allowing increase of resources to an allocated container. If you agree then we can change the title of this jira to reflect that and link it to YARN-896 .
          Hide
          Wangda Tan added a comment -

          Hi Bikas,
          Thanks for reply, it helps me understanding YARN mechanism, but I think there're some misunderstanding.

          In some HPC cases, how many processes will be launched in different node is not determinated before we submit job, just give it total enough resource (like 100G) in the cluster to it. So we will have following problems,
          1) We will launch exactly one daemon process in each node, and this daemon process launch other local processes. This is root cause of why we want this feature
          2) We don't know how much resource to request in this case,

          1. Large requests may cause some wasting, and it's hard to get from RM
          2. Small requests may not enough (when cluster is busy, we cannot "regret" if we already have a small room in a node, we can only return it and ask a larger one, but when we returned it, the room may be occupied by another app, and we cannot take it back.

          When we have a such API, we can implement our AM more easily, we can iteratively send request to RM which is depended on what we already have. And finally, we can merge them to different big containers and give it to real app. (like PBS/TORQUE/MPI), we can make a "small cluster" in YARN, and can support HPC workloads very well. (It's a little similar to mesos, aggregate resources to a slave daemon, and the slave daemon can manage these resources, but we don't need make it dynamic – increase container size when its running, just merge it before we start processes will be good enough)

          Show
          Wangda Tan added a comment - Hi Bikas, Thanks for reply, it helps me understanding YARN mechanism, but I think there're some misunderstanding. In some HPC cases, how many processes will be launched in different node is not determinated before we submit job, just give it total enough resource (like 100G) in the cluster to it. So we will have following problems, 1) We will launch exactly one daemon process in each node, and this daemon process launch other local processes. This is root cause of why we want this feature 2) We don't know how much resource to request in this case, Large requests may cause some wasting, and it's hard to get from RM Small requests may not enough (when cluster is busy, we cannot "regret" if we already have a small room in a node, we can only return it and ask a larger one, but when we returned it, the room may be occupied by another app, and we cannot take it back. When we have a such API, we can implement our AM more easily, we can iteratively send request to RM which is depended on what we already have. And finally, we can merge them to different big containers and give it to real app. (like PBS/TORQUE/MPI), we can make a "small cluster" in YARN, and can support HPC workloads very well. (It's a little similar to mesos, aggregate resources to a slave daemon, and the slave daemon can manage these resources, but we don't need make it dynamic – increase container size when its running, just merge it before we start processes will be good enough)
          Hide
          Bikas Saha added a comment -

          1) We can incrementally send resource request with small resources like before, until we get enough resources in total

          2) Merge resource in the same node, make only one big container in each node

          When the RM is asked for a container then this is what the RM does. It incrementally adds reserved space on a node until it can allocate the full resources desired by the container. Then it assigns the container to the app. So its not clear how making small allocations and then merging them in the app is going to help.
          By asking the RM directly for 10G resources we can ensure that the RM will eventually give us that. If we ask for 10 1G resources then we are not guaranteed that the RM will give them to us on the same node and thus the overall request may be unsatisfiable.

          Show
          Bikas Saha added a comment - 1) We can incrementally send resource request with small resources like before, until we get enough resources in total 2) Merge resource in the same node, make only one big container in each node When the RM is asked for a container then this is what the RM does. It incrementally adds reserved space on a node until it can allocate the full resources desired by the container. Then it assigns the container to the app. So its not clear how making small allocations and then merging them in the app is going to help. By asking the RM directly for 10G resources we can ensure that the RM will eventually give us that. If we ask for 10 1G resources then we are not guaranteed that the RM will give them to us on the same node and thus the overall request may be unsatisfiable.
          Hide
          Bikas Saha added a comment -

          (Copying description into comments to reduce email size.)

          In some applications (like OpenMPI) has their own daemons in each node (one for each node) in their original implementation, and their user's processes are directly launched by its local daemon (like task-tracker in MRv1, but it's per-application). Many functionalities are depended on the pipes created when a process forked by its father, like IO-forwarding, process monitoring (it will do more logic than what NM did for us) and may cause some scalability issues.

          A very common resource request in MPI world is, "give me 100G memory in the cluster, I will launch 100 processes in this resource". In current YARN, we have following two choices to make this happen,
          1) Send allocation request with 1G memory iteratively, until we got 100G memories in total. Then ask NM launch such 100 MPI processes. That will cause some problems like cannot support IO-forwarding, processes monitoring, etc. as mentioned above.
          2) Send a larger resource request, like 10G. But we may encounter following problems,
          2.1 Such a large resource request is hard to get at one time.
          2.2 We cannot use other resources more than the number we specified in the node (we can only launch one daemon in one node).
          2.3 Hard to decide how much resource to ask.

          So my proposal is,
          1) We can incrementally send resource request with small resources like before, until we get enough resources in total
          2) Merge resource in the same node, make only one big container in each node
          3) Launch daemons in each node, and the daemon will spawn its local processes and manage them.

          For example,
          We need to run 10 processes, 1G for each, finally we got
          container 1, 2, 3, 4, 5 in node1.
          container 6, 7, 8 in node2.
          container 9, 10 in node3.
          Then we will,
          merge [1, 2, 3, 4, 5] to container_11 with 5G, launch a daemon, and the daemon will launch 5 processes
          merge [6, 7, 8] to container_12 with 3G, launch a daemon, and the daemon will launch 3 processes
          merge [9, 10] to container_13 with 2G, launch a daemon, and the daemon will launch 2 processes

          Show
          Bikas Saha added a comment - (Copying description into comments to reduce email size.) In some applications (like OpenMPI) has their own daemons in each node (one for each node) in their original implementation, and their user's processes are directly launched by its local daemon (like task-tracker in MRv1, but it's per-application). Many functionalities are depended on the pipes created when a process forked by its father, like IO-forwarding, process monitoring (it will do more logic than what NM did for us) and may cause some scalability issues. A very common resource request in MPI world is, "give me 100G memory in the cluster, I will launch 100 processes in this resource". In current YARN, we have following two choices to make this happen, 1) Send allocation request with 1G memory iteratively, until we got 100G memories in total. Then ask NM launch such 100 MPI processes. That will cause some problems like cannot support IO-forwarding, processes monitoring, etc. as mentioned above. 2) Send a larger resource request, like 10G. But we may encounter following problems, 2.1 Such a large resource request is hard to get at one time. 2.2 We cannot use other resources more than the number we specified in the node (we can only launch one daemon in one node). 2.3 Hard to decide how much resource to ask. So my proposal is, 1) We can incrementally send resource request with small resources like before, until we get enough resources in total 2) Merge resource in the same node, make only one big container in each node 3) Launch daemons in each node, and the daemon will spawn its local processes and manage them. For example, We need to run 10 processes, 1G for each, finally we got container 1, 2, 3, 4, 5 in node1. container 6, 7, 8 in node2. container 9, 10 in node3. Then we will, merge [1, 2, 3, 4, 5] to container_11 with 5G, launch a daemon, and the daemon will launch 5 processes merge [6, 7, 8] to container_12 with 3G, launch a daemon, and the daemon will launch 3 processes merge [9, 10] to container_13 with 2G, launch a daemon, and the daemon will launch 2 processes
          Bikas Saha made changes -
          Field Original Value New Value
          Description Currently, YARN cannot support merge several containers in one node to a big container, which can make us incrementally ask resources, merge them to a bigger one, and launch our processes. The user scenario is,

          In some applications (like OpenMPI) has their own daemons in each node (one for each node) in their original implementation, and their user's processes are directly launched by its local daemon (like task-tracker in MRv1, but it's per-application). Many functionalities are depended on the pipes created when a process forked by its father, like IO-forwarding, process monitoring (it will do more logic than what NM did for us) and may cause some scalability issues.

          A very common resource request in MPI world is, "give me 100G memory in the cluster, I will launch 100 processes in this resource". In current YARN, we have following two choices to make this happen,
          1) Send allocation request with 1G memory iteratively, until we got 100G memories in total. Then ask NM launch such 100 MPI processes. That will cause some problems like cannot support IO-forwarding, processes monitoring, etc. as mentioned above.
          2) Send a larger resource request, like 10G. But we may encounter following problems,
             2.1 Such a large resource request is hard to get at one time.
             2.2 We cannot use other resources more than the number we specified in the node (we can only launch one daemon in one node).
             2.3 Hard to decide how much resource to ask.

          So my proposal is,
          1) We can incrementally send resource request with small resources like before, until we get enough resources in total
          2) Merge resource in the same node, make only one big container in each node
          3) Launch daemons in each node, and the daemon will spawn its local processes and manage them.

          For example,
          We need to run 10 processes, 1G for each, finally we got
          container 1, 2, 3, 4, 5 in node1.
          container 6, 7, 8 in node2.
          container 9, 10 in node3.
          Then we will,
          merge [1, 2, 3, 4, 5] to container_11 with 5G, launch a daemon, and the daemon will launch 5 processes
          merge [6, 7, 8] to container_12 with 3G, launch a daemon, and the daemon will launch 3 processes
          merge [9, 10] to container_13 with 2G, launch a daemon, and the daemon will launch 2 processes
          Currently, YARN cannot support merge several containers in one node to a big container, which can make us incrementally ask resources, merge them to a bigger one, and launch our processes. The user scenario is described in the comments.
          Hide
          Wangda Tan added a comment -

          I don't know is it possible to add this in RM or NM side.
          And I think it should be easier to move some existing applications (OpenMPI, PBS, etc.) to YARN platform, because such application should have their own daemons in old implementation, and container merge can be helpful to leverage their original logic with less modifications to be a resident of YARN
          Welcome your suggestions and comments!

          Thanks,
          Wangda

          Show
          Wangda Tan added a comment - I don't know is it possible to add this in RM or NM side. And I think it should be easier to move some existing applications (OpenMPI, PBS, etc.) to YARN platform, because such application should have their own daemons in old implementation, and container merge can be helpful to leverage their original logic with less modifications to be a resident of YARN Welcome your suggestions and comments! – Thanks, Wangda
          Wangda Tan created issue -

            People

            • Assignee:
              Unassigned
              Reporter:
              Wangda Tan
            • Votes:
              12 Vote for this issue
              Watchers:
              84 Start watching this issue

              Dates

              • Created:
                Updated:

                Development