Hadoop YARN
  1. Hadoop YARN
  2. YARN-1404

Enable external systems/frameworks to share resources with Hadoop leveraging Yarn resource scheduling

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 2.2.0
    • Fix Version/s: None
    • Component/s: nodemanager
    • Labels:
      None

      Description

      Currently Hadoop Yarn expects to manage the lifecycle of the processes its applications run workload in. External frameworks/systems could benefit from sharing resources with other Yarn applications while running their workload within long-running processes owned by the external framework (in other words, running their workload outside of the context of a Yarn container process).

      Because Yarn provides robust and scalable resource management, it is desirable for some external systems to leverage the resource governance capabilities of Yarn (queues, capacities, scheduling, access control) while supplying their own resource enforcement.

      Impala is an example of such system. Impala uses Llama (http://cloudera.github.io/llama/) to request resources from Yarn.

      Impala runs an impalad process in every node of the cluster, when a user submits a query, the processing is broken into 'query fragments' which are run in multiple impalad processes leveraging data locality (similar to Map-Reduce Mappers processing a collocated HDFS block of input data).

      The execution of a 'query fragment' requires an amount of CPU and memory in the impalad. As the impalad shares the host with other services (HDFS DataNode, Yarn NodeManager, Hbase Region Server) and Yarn Applications (MapReduce tasks).

      To ensure cluster utilization that follow the Yarn scheduler policies and it does not overload the cluster nodes, before running a 'query fragment' in a node, Impala requests the required amount of CPU and memory from Yarn. Once the requested CPU and memory has been allocated, Impala starts running the 'query fragment' taking care that the 'query fragment' does not use more resources than the ones that have been allocated. Memory is book kept per 'query fragment' and the threads used for the processing of the 'query fragment' are placed under a cgroup to contain CPU utilization.

      Today, for all resources that have been asked to Yarn RM, a (container) process must be started via the corresponding NodeManager. Failing to do this, will result on the cancelation of the container allocation relinquishing the acquired resource capacity back to the pool of available resources. To avoid this, Impala starts a dummy container process doing 'sleep 10y'.

      Using a dummy container process has its drawbacks:

      • the dummy container process is in a cgroup with a given number of CPU shares that are not used and Impala is re-issuing those CPU shares to another cgroup for the thread running the 'query fragment'. The cgroup CPU enforcement works correctly because of the CPU controller implementation (but the formal specified behavior is actually undefined).
      • Impala may ask for CPU and memory independent of each other. Some requests may be only memory with no CPU or viceversa. Because a container requires a process, complete absence of memory or CPU is not possible even if the dummy process is 'sleep', a minimal amount of memory and CPU is required for the dummy process.

      Because of this it is desirable to be able to have a container without a backing process.

      1. YARN-1404.patch
        31 kB
        Alejandro Abdelnur

        Issue Links

          Activity

          Hide
          Alejandro Abdelnur added a comment -

          The idea for unmanaged containers is not to modify the lifecycle of a request/allocation/activation of container requests. This means that an unmanaged container follow the exact same path that regular containers with the sole exception that no process is started for it.

          The ContainerLauncherContext would have a UNMANAGED_CONTAINER constant which is an 'empty' ContainerLauncherContext instance (no environment, no command, no local resources, etc).

          If the UNMANAGED_CONTAINER constant is used as the ContainerLauncherContext in a StartContainerRequest when doing a ContainerManagementProtocol#startContainers(...) call, then the NodeManager would not start the container process.

          In the NodeManager, there is one ContainerLaunch instance per running container which blocks in its call() method while the container is running. And with the cleanUp() container method ends the container. For unmanaged containers these 2 methods would simple use a latch instead starting-blocking/stopping a process. By doing this, unmanaged containers will also block as regular containers.

          In addition, we need the ContainersMonitorImpl must ignore unmanaged containers as there is not underlying process tree to monitor. This can be done adding an ContainersMonitorEvent.isUnmanagedContainer() method that will indicate that here is no underlying process to monitor.

          Show
          Alejandro Abdelnur added a comment - The idea for unmanaged containers is not to modify the lifecycle of a request/allocation/activation of container requests. This means that an unmanaged container follow the exact same path that regular containers with the sole exception that no process is started for it. The ContainerLauncherContext would have a UNMANAGED_CONTAINER constant which is an 'empty' ContainerLauncherContext instance (no environment, no command, no local resources, etc). If the UNMANAGED_CONTAINER constant is used as the ContainerLauncherContext in a StartContainerRequest when doing a ContainerManagementProtocol#startContainers(...) call, then the NodeManager would not start the container process. In the NodeManager, there is one ContainerLaunch instance per running container which blocks in its call() method while the container is running. And with the cleanUp() container method ends the container. For unmanaged containers these 2 methods would simple use a latch instead starting-blocking/stopping a process. By doing this, unmanaged containers will also block as regular containers. In addition, we need the ContainersMonitorImpl must ignore unmanaged containers as there is not underlying process tree to monitor. This can be done adding an ContainersMonitorEvent.isUnmanagedContainer() method that will indicate that here is no underlying process to monitor.
          Hide
          Alejandro Abdelnur added a comment -

          Initial patch showing the described approach.

          Show
          Alejandro Abdelnur added a comment - Initial patch showing the described approach.
          Hide
          Hitesh Shah added a comment -

          Alejandro Abdelnur How is scheduling management/enforcment ( preemption, etc ) meant to work with unmanaged containers? Is an unmanaged container an actual process that is running on the NM but not controlled by the NM? If yes, how would it be killed if the container is preempted by the RM?

          Unless I am mistaken, at this point, it seems like 2 features are needed: container leases and/or NM resource resizing.

          Show
          Hitesh Shah added a comment - Alejandro Abdelnur How is scheduling management/enforcment ( preemption, etc ) meant to work with unmanaged containers? Is an unmanaged container an actual process that is running on the NM but not controlled by the NM? If yes, how would it be killed if the container is preempted by the RM? Unless I am mistaken, at this point, it seems like 2 features are needed: container leases and/or NM resource resizing.
          Hide
          Hitesh Shah added a comment -

          Alejandro Abdelnur Looks like no process is launched. I believe this should be solved by adding support for container leases and not introduced flags into the container launch context.

          Show
          Hitesh Shah added a comment - Alejandro Abdelnur Looks like no process is launched. I believe this should be solved by adding support for container leases and not introduced flags into the container launch context.
          Hide
          Alejandro Abdelnur added a comment -

          Hitesh Shah,

          How is scheduling management/enforcement (preemption, etc) meant to work with unmanaged containers?

          The AM that started the unmanaged container gets the early-preemption/preemption/lost notification from the RM and notifies the out of band process in the corresponding node to release the corresponding resources. (Impala/Llama is doing this today with the dummy sleep containers)

          A NM plugin notifies the collocated out of band process that the unmanaged container as ended. This prompts the out of band process to release the corresponding resources. (We are working on getting this in Impala/Llama).

          In theory, the former is sufficient. In practice, having the later as well, drives a faster reaction to preemption/lost of resources.

          it seems like 2 features are needed: ... NM resource resizing.

          IMO, NM resource resizing is orthogonal to unmanaged resources.

          it seems like 2 features are needed: container leases ...

          In the current proposal the container leases are out of band, they happen between the process using the resources out of band (i.e. Impala) and the AM (i.e. Llama).

          The reason I've taken the approach of leaving the container leases out of band is:

          • To keep a single lifecycle for containers instead of two different lifecycles. This keeps intact the current state transitions reducing the changes of introducing errors there now or when the lifecycle evolves.
          • A lease would require an additional call to the renew the lease. This would require introducing lease tokens as the lease could be done by an out of band system.
          • If the RM is the recipient of lease renewals is the RM we are adding additional responsibilities to the RM, and handling additional clients, potentially from several new clients (out of band processes).
          • If the NM is the recipient of the lease, we still need a flag when launching the container to indicate the NM that the container is unmanaged and leases will be coming in.

          IMO, I don't think we gain much by having Yarn to manage leases from unmanaged containers as it is still in the hands of the out of band process using the the container resources to effectively release the resources when asked to.

          Thoughts?

          Show
          Alejandro Abdelnur added a comment - Hitesh Shah , How is scheduling management/enforcement (preemption, etc) meant to work with unmanaged containers? The AM that started the unmanaged container gets the early-preemption/preemption/lost notification from the RM and notifies the out of band process in the corresponding node to release the corresponding resources. (Impala/Llama is doing this today with the dummy sleep containers) A NM plugin notifies the collocated out of band process that the unmanaged container as ended. This prompts the out of band process to release the corresponding resources. (We are working on getting this in Impala/Llama). In theory, the former is sufficient. In practice, having the later as well, drives a faster reaction to preemption/lost of resources. it seems like 2 features are needed: ... NM resource resizing. IMO, NM resource resizing is orthogonal to unmanaged resources. it seems like 2 features are needed: container leases ... In the current proposal the container leases are out of band, they happen between the process using the resources out of band (i.e. Impala) and the AM (i.e. Llama). The reason I've taken the approach of leaving the container leases out of band is: To keep a single lifecycle for containers instead of two different lifecycles. This keeps intact the current state transitions reducing the changes of introducing errors there now or when the lifecycle evolves. A lease would require an additional call to the renew the lease. This would require introducing lease tokens as the lease could be done by an out of band system. If the RM is the recipient of lease renewals is the RM we are adding additional responsibilities to the RM, and handling additional clients, potentially from several new clients (out of band processes). If the NM is the recipient of the lease, we still need a flag when launching the container to indicate the NM that the container is unmanaged and leases will be coming in. IMO, I don't think we gain much by having Yarn to manage leases from unmanaged containers as it is still in the hands of the out of band process using the the container resources to effectively release the resources when asked to. Thoughts?
          Hide
          Bikas Saha added a comment -

          Have you looked at YARN-1040. It envisages delinking container lifecycle from process lifecycle. So a container may be associated with 0 processes and can run a succession of processes. Apps in essence get a chunk of resources on that machine and can choose to run processes whenever they want on that machine. IMO, YARN-1040 subsumes this jira and is conceptually more generic.

          Show
          Bikas Saha added a comment - Have you looked at YARN-1040 . It envisages delinking container lifecycle from process lifecycle. So a container may be associated with 0 processes and can run a succession of processes. Apps in essence get a chunk of resources on that machine and can choose to run processes whenever they want on that machine. IMO, YARN-1040 subsumes this jira and is conceptually more generic.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12613372/YARN-1404.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified test files.

          -1 javac. The applied patch generated 1549 javac compiler warnings (more than the trunk's current 1544 warnings).

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:

          org.apache.hadoop.yarn.client.api.impl.TestNMClient

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2422//testReport/
          Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/2422//artifact/trunk/patchprocess/diffJavacWarnings.txt
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2422//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12613372/YARN-1404.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 4 new or modified test files. -1 javac . The applied patch generated 1549 javac compiler warnings (more than the trunk's current 1544 warnings). +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.client.api.impl.TestNMClient +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2422//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/2422//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2422//console This message is automatically generated.
          Hide
          Alejandro Abdelnur added a comment -

          Hitesh Shah, We actually don't need the constant UNMANAGED_CONTAINER, we can just specify NULL and would mean not process associated to the container.

          Bikas Saha, I'm following YARN-1040 but it didn't strike me as related, specially as it cares about >1 processes. But in the way you position it can be as related.

          So, starting a container with a NULL ContainerLauncherContext would start the container life and avoid it from being timeout and claimed by the RM.

          Then this JIRA could be seen as a subtask of YARN-1040 that enables the zero process use case to be followed up by the reuse use case. Do you agree?

          Show
          Alejandro Abdelnur added a comment - Hitesh Shah , We actually don't need the constant UNMANAGED_CONTAINER , we can just specify NULL and would mean not process associated to the container. Bikas Saha , I'm following YARN-1040 but it didn't strike me as related, specially as it cares about >1 processes. But in the way you position it can be as related. So, starting a container with a NULL ContainerLauncherContext would start the container life and avoid it from being timeout and claimed by the RM. Then this JIRA could be seen as a subtask of YARN-1040 that enables the zero process use case to be followed up by the reuse use case. Do you agree?
          Hide
          Alejandro Abdelnur added a comment -

          I've started working out on a patch that takes NULL as no process instead of the constant but NPEs pop all over the code as i keep guarding against them. I'm inclined to either keep the constant in the public API, and rename it to NO_PROCESS. Or if we want to have NULL in the API, in the NM on NULL arrival use a private NO_PROCESS constant.

          Show
          Alejandro Abdelnur added a comment - I've started working out on a patch that takes NULL as no process instead of the constant but NPEs pop all over the code as i keep guarding against them. I'm inclined to either keep the constant in the public API, and rename it to NO_PROCESS . Or if we want to have NULL in the API, in the NM on NULL arrival use a private NO_PROCESS constant.
          Hide
          Steve Loughran added a comment -
          1. I'd be inclined to treat this as a special case of YARN-1040, the "0 process state". If both can be addressed in the same code path, that's one less code path to look after.
          2. It's dangerously easy to leak containers here; I know llama keeps an eye on things, but I worry about other people's code -though admittedly, any long-lived command line app "yes" could do the same.

          For the multi-process (and that includes processes=0), we really do need some kind of lease renewal option to stop containers being retained forever. It would become the job of the AM to do the renewal

          Show
          Steve Loughran added a comment - I'd be inclined to treat this as a special case of YARN-1040 , the "0 process state". If both can be addressed in the same code path, that's one less code path to look after. It's dangerously easy to leak containers here; I know llama keeps an eye on things, but I worry about other people's code -though admittedly, any long-lived command line app "yes" could do the same. For the multi-process (and that includes processes=0), we really do need some kind of lease renewal option to stop containers being retained forever. It would become the job of the AM to do the renewal
          Hide
          Vinod Kumar Vavilapalli added a comment -

          -1 for this. (my first on any JIRA).

          As I repeated on other JIRAs, please change the title with the problem statement instead of solutions.

          Currently a container allocation requires to start a container process with the corresponding NodeManager's node.

          For applications that need to use the allocated resources out of band from Yarn this means that a dummy container process must be started.

          I indicated offline about llama with others. I don't think you need NodeManagers either to do what you want, forget about containers. All you need is use the ResourceManager/scheduler in isolation using MockRM/LightWeightRM (YARN-1385) - your need seems to be using the scheduling logic in YARN and not use the physical resources.

          Show
          Vinod Kumar Vavilapalli added a comment - -1 for this. (my first on any JIRA). As I repeated on other JIRAs, please change the title with the problem statement instead of solutions. Currently a container allocation requires to start a container process with the corresponding NodeManager's node. For applications that need to use the allocated resources out of band from Yarn this means that a dummy container process must be started. I indicated offline about llama with others. I don't think you need NodeManagers either to do what you want, forget about containers. All you need is use the ResourceManager/scheduler in isolation using MockRM/LightWeightRM ( YARN-1385 ) - your need seems to be using the scheduling logic in YARN and not use the physical resources.
          Hide
          Sandy Ryza added a comment -

          Vinod Kumar Vavilapalli, a lightweight RM is not sufficient because the goal of llama is to be able to run frameworks that use unmanaged containers alongside frameworks that don't. While Impala does its own resource enforcement, it wants to coexist on a YARN instance with MR and other frameworks that fit more naturally with the YARN model.

          Are you saying YARN should never support containers that don't launch a process? Is there anything gained by this?

          Show
          Sandy Ryza added a comment - Vinod Kumar Vavilapalli , a lightweight RM is not sufficient because the goal of llama is to be able to run frameworks that use unmanaged containers alongside frameworks that don't. While Impala does its own resource enforcement, it wants to coexist on a YARN instance with MR and other frameworks that fit more naturally with the YARN model. Are you saying YARN should never support containers that don't launch a process? Is there anything gained by this?
          Hide
          Alejandro Abdelnur added a comment -

          Steve Loughran

          1. I'd be inclined to treat this as a special case of YARN-1040

          I've just commented in YARN-1040 following Bikas' comment on this https://issues.apache.org/jira/browse/YARN-1040?focusedCommentId=13821597&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13821597

          It's dangerously easy to leak containers here; I know llama keeps an eye on things, but I worry about other people's code -though admittedly, any long-lived command line app "yes" could do the same.

          We can have NM configs to disable no-process or multi-process, but still you can workaround this around by having a dummy process. This is how Llama is doing things today, but it is not ideal for several reasons.

          IMO, from Yarn perspective we need to allow AMs to be able to do sophisticated things within the Yarn programming model (like you are trying to do with long-lived containers or what I'm doing with Llama).

          For the multi-process (and that includes processes=0), we really do need some kind of lease renewal option to stop containers being retained forever. It would become the job of the AM to do the renewal

          As I've mentioned above, I don't think we need a special lease for this: https://issues.apache.org/jira/browse/YARN-1404?focusedCommentId=13820200&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13820200 (look for 'The reason I've taken the approach of leaving the container leases out of band is:')

          Vinod Kumar Vavilapalli

          -1 for this...

          I think you are jumping too fast here.

          As I repeated on other JIRAs, please change the title with the problem statement instead of solutions.

          IMO that makes completely sense for bugs, for improvements/new-features a description of it communicates more as it will be the commit message. The shortcomings the JIRA is trying to address should be captured in the description.

          Take for example the following JIRA summaries, would you change them to describe a problem?

          • AHS should support application-acls and queue-acls
          • AM's tracking URL should be a URL instead of a string
          • Add support for zipping/unzipping logs while in transit for the NM logs web-service
          • YARN should have a ClusterId/ServiceId

          I indicated offline about llama with others. I don't think you need NodeManagers either to do what you want, forget about containers. All you need is use the ResourceManager/scheduler in isolation using MockRM/LightWeightRM (YARN-1385) - your need seems to be using the scheduling logic in YARN and not use the physical resources.

          The whole point of Llama is to allow Impala to share resources in a real Yarn cluster doing other workloads like Map-Reduce. In other words, Impala/Llama and other AMs must share cluster resources.

          Show
          Alejandro Abdelnur added a comment - Steve Loughran 1. I'd be inclined to treat this as a special case of YARN-1040 I've just commented in YARN-1040 following Bikas' comment on this https://issues.apache.org/jira/browse/YARN-1040?focusedCommentId=13821597&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13821597 It's dangerously easy to leak containers here; I know llama keeps an eye on things, but I worry about other people's code -though admittedly, any long-lived command line app "yes" could do the same. We can have NM configs to disable no-process or multi-process, but still you can workaround this around by having a dummy process. This is how Llama is doing things today, but it is not ideal for several reasons. IMO, from Yarn perspective we need to allow AMs to be able to do sophisticated things within the Yarn programming model (like you are trying to do with long-lived containers or what I'm doing with Llama). For the multi-process (and that includes processes=0), we really do need some kind of lease renewal option to stop containers being retained forever. It would become the job of the AM to do the renewal As I've mentioned above, I don't think we need a special lease for this: https://issues.apache.org/jira/browse/YARN-1404?focusedCommentId=13820200&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13820200 (look for 'The reason I've taken the approach of leaving the container leases out of band is:') Vinod Kumar Vavilapalli -1 for this... I think you are jumping too fast here. As I repeated on other JIRAs, please change the title with the problem statement instead of solutions. IMO that makes completely sense for bugs, for improvements/new-features a description of it communicates more as it will be the commit message. The shortcomings the JIRA is trying to address should be captured in the description. Take for example the following JIRA summaries, would you change them to describe a problem? AHS should support application-acls and queue-acls AM's tracking URL should be a URL instead of a string Add support for zipping/unzipping logs while in transit for the NM logs web-service YARN should have a ClusterId/ServiceId I indicated offline about llama with others. I don't think you need NodeManagers either to do what you want, forget about containers. All you need is use the ResourceManager/scheduler in isolation using MockRM/LightWeightRM ( YARN-1385 ) - your need seems to be using the scheduling logic in YARN and not use the physical resources. The whole point of Llama is to allow Impala to share resources in a real Yarn cluster doing other workloads like Map-Reduce. In other words, Impala/Llama and other AMs must share cluster resources.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Vinod Kumar Vavilapalli, a lightweight RM is not sufficient because the goal of llama is to be able to run frameworks that use unmanaged containers alongside frameworks that don't. While Impala does its own resource enforcement, it wants to coexist on a YARN instance with MR and other frameworks that fit more naturally with the YARN model.

          Well, this has been my problem, I'm sure others will agree. Proposing unmanaged containers before explaining your key requirements keeps folks only looking at JIRA in the dark.

          Are you saying YARN should never support containers that don't launch a process? Is there anything gained by this?

          If that need arises, and if there are no other first-class solutions, then yes. Otherwise no.

          I think you are jumping too fast here

          That's because I see multiple JIRAs all trying to achieve a common goal and instead of discussing that design, we are shoe-horned into debating on individual tickets that don't make up the overall goal.

          IMO that makes completely sense for bugs, for improvements/new-features a description of it communicates more as it will be the commit message. The shortcomings the JIRA is trying to address should be captured in the description.

          Agree that it is subjective. But in some of the tickets that potentially have a solution-space > 1, I'd suggest renaming them. For e.g., this on can be renamed to "support running a service that doesn't want to use YARN containers but still co-exists with YARN"

          Take for example the following JIRA summaries, would you change them to describe a problem?

          AM's tracking URL should be a URL instead of a string

          YARN should have a ClusterId/ServiceId

          Yes, I'd change the above two. The other two are apt summaries. The goal should be indicating the problem one is attacking. And my point here is not that you or someone is making that mistake and others are not.

          The whole point of Llama is to allow Impala to share resources in a real Yarn cluster doing other workloads like Map-Reduce. In other words, Impala/Llama and other AMs must share cluster resources.

          Well, you should have started with this requirement so that we can all discuss and come up with a solution instead of putting in approaches that you think are best. This was the same discussion we had in YARN-689 where it took a while for the rest of us to understand the real requirements. Similarly, YARN-789 was put in FairScheduler without giving considerations to the rest of the system.

          The AM that started the unmanaged container gets the early-preemption/preemption/lost notification from the RM and notifies the out of band process in the corresponding node to release the corresponding resources. (Impala/Llama is doing this today with the dummy sleep containers)

          That won't work for cases where RM wants to forcefully terminate in emergency situations.

          A NM plugin notifies the collocated out of band process that the unmanaged container as ended. This prompts the out of band process to release the corresponding resources. (We are working on getting this in Impala/Llama).

          This again is a new proposal which is never discussed.

          Re this problem, I think you should create a ticket about supporting services that want to use cluster and node level scheduling without using containers. Then if you follow up with a requirement list, we can discuss solutions and an end-to-end design. I can come with more solutions already, which may or may not work depending on your requirements.

          • Use the dynamic NM resource stuff that just went in and use signalling between YARN NM and some outside component to dynamically adjust NM resources
          • Run a long running service under YARN with containers that dynamically grow and shrink
          Show
          Vinod Kumar Vavilapalli added a comment - Vinod Kumar Vavilapalli, a lightweight RM is not sufficient because the goal of llama is to be able to run frameworks that use unmanaged containers alongside frameworks that don't. While Impala does its own resource enforcement, it wants to coexist on a YARN instance with MR and other frameworks that fit more naturally with the YARN model. Well, this has been my problem, I'm sure others will agree. Proposing unmanaged containers before explaining your key requirements keeps folks only looking at JIRA in the dark. Are you saying YARN should never support containers that don't launch a process? Is there anything gained by this? If that need arises, and if there are no other first-class solutions, then yes. Otherwise no. I think you are jumping too fast here That's because I see multiple JIRAs all trying to achieve a common goal and instead of discussing that design, we are shoe-horned into debating on individual tickets that don't make up the overall goal. IMO that makes completely sense for bugs, for improvements/new-features a description of it communicates more as it will be the commit message. The shortcomings the JIRA is trying to address should be captured in the description. Agree that it is subjective. But in some of the tickets that potentially have a solution-space > 1, I'd suggest renaming them. For e.g., this on can be renamed to "support running a service that doesn't want to use YARN containers but still co-exists with YARN" Take for example the following JIRA summaries, would you change them to describe a problem? AM's tracking URL should be a URL instead of a string YARN should have a ClusterId/ServiceId Yes, I'd change the above two. The other two are apt summaries. The goal should be indicating the problem one is attacking. And my point here is not that you or someone is making that mistake and others are not. The whole point of Llama is to allow Impala to share resources in a real Yarn cluster doing other workloads like Map-Reduce. In other words, Impala/Llama and other AMs must share cluster resources. Well, you should have started with this requirement so that we can all discuss and come up with a solution instead of putting in approaches that you think are best. This was the same discussion we had in YARN-689 where it took a while for the rest of us to understand the real requirements. Similarly, YARN-789 was put in FairScheduler without giving considerations to the rest of the system. The AM that started the unmanaged container gets the early-preemption/preemption/lost notification from the RM and notifies the out of band process in the corresponding node to release the corresponding resources. (Impala/Llama is doing this today with the dummy sleep containers) That won't work for cases where RM wants to forcefully terminate in emergency situations. A NM plugin notifies the collocated out of band process that the unmanaged container as ended. This prompts the out of band process to release the corresponding resources. (We are working on getting this in Impala/Llama). This again is a new proposal which is never discussed. Re this problem, I think you should create a ticket about supporting services that want to use cluster and node level scheduling without using containers. Then if you follow up with a requirement list, we can discuss solutions and an end-to-end design. I can come with more solutions already, which may or may not work depending on your requirements. Use the dynamic NM resource stuff that just went in and use signalling between YARN NM and some outside component to dynamically adjust NM resources Run a long running service under YARN with containers that dynamically grow and shrink
          Hide
          Vinod Kumar Vavilapalli added a comment -

          On top of what I said, want to say this: Containers being processes is a fundamental assumption in all of YARN. If you want to change that assumption, you better be sure of the requirements - it has far reaching repercussions - both in terms of past/present code and future design - it's not just putting in a flag here and another one there.

          Show
          Vinod Kumar Vavilapalli added a comment - On top of what I said, want to say this: Containers being processes is a fundamental assumption in all of YARN. If you want to change that assumption, you better be sure of the requirements - it has far reaching repercussions - both in terms of past/present code and future design - it's not just putting in a flag here and another one there.
          Hide
          Bikas Saha added a comment -

          It looks like the uber point under discussion here is what are the scenarios that we are targeting and can we figure out an overall design and a coherent set of changes in YARN that address those scenarios (if we decide that those scenarios make sense for YARN).
          I think I agree that it would really helpful if the scenarios were properly laid out and a design proposal made so that other people in the community can understand what is the goal and why some changes are being suggested and made. Then finalize the design and creates a set of sub-tasks or related tasks that can be clearly related to that scenario. We have already been following that for RM Restart, RM HA, RM Work preserving restart, long running services etc. Without this kind of a coherent approach, it may appear that set of disparate changes are being made to the framework in an ad-hoc basis. The sum total of these changes may end up altering the framework in unexpected ways for the community.
          In this light, it would be really helpful if we create an umbrella jira for the goals we want to meet with this and other Llama related changes. And publish an overall design plan that presents a coherent picture of the proposed picture. It would also help if we made the already committed & in-flight jiras related to this effort as sub-tasks of the umbrella jira. Does that sound like a reasonable plan going forward?

          Show
          Bikas Saha added a comment - It looks like the uber point under discussion here is what are the scenarios that we are targeting and can we figure out an overall design and a coherent set of changes in YARN that address those scenarios (if we decide that those scenarios make sense for YARN). I think I agree that it would really helpful if the scenarios were properly laid out and a design proposal made so that other people in the community can understand what is the goal and why some changes are being suggested and made. Then finalize the design and creates a set of sub-tasks or related tasks that can be clearly related to that scenario. We have already been following that for RM Restart, RM HA, RM Work preserving restart, long running services etc. Without this kind of a coherent approach, it may appear that set of disparate changes are being made to the framework in an ad-hoc basis. The sum total of these changes may end up altering the framework in unexpected ways for the community. In this light, it would be really helpful if we create an umbrella jira for the goals we want to meet with this and other Llama related changes. And publish an overall design plan that presents a coherent picture of the proposed picture. It would also help if we made the already committed & in-flight jiras related to this effort as sub-tasks of the umbrella jira. Does that sound like a reasonable plan going forward?
          Hide
          Alejandro Abdelnur added a comment -

          Updated the summary and the description to better describe the use case driving this JIRA.

          I've closed YARN-951 as "won't fix" as it is a workaround of the problem this JIRA is trying to address.

          I don't think there is a need for an umbrella JIRA as this is the only change we need.

          Show
          Alejandro Abdelnur added a comment - Updated the summary and the description to better describe the use case driving this JIRA. I've closed YARN-951 as "won't fix" as it is a workaround of the problem this JIRA is trying to address. I don't think there is a need for an umbrella JIRA as this is the only change we need.
          Hide
          Alejandro Abdelnur added a comment -

          The proposal to address this JIRA is:

          • Allow a NULL ContainerLaunchContext in the startContainer() call, this signals the is not process to be started with the container.
          • The ContainerLaunch logic would use a latch to lock when there is not associated process. The latch will be released on container completion (preemption or terminated by the AM)

          The changes to achieve this are minimal and they do not alter at all the lifecycle of a container, nor in the RM, nor in the NM.

          As previously mentioned by Bikas, this can be seen as a special case of the functionality that YARN-1040 is proposing for managing multiple processes with the same container.

          The scope of work of YARN-1040 is significantly larger and requires API changes, while this JIRA does not require API changes and the changes are not incompatible with each other.

          Show
          Alejandro Abdelnur added a comment - The proposal to address this JIRA is: Allow a NULL ContainerLaunchContext in the startContainer() call, this signals the is not process to be started with the container. The ContainerLaunch logic would use a latch to lock when there is not associated process. The latch will be released on container completion (preemption or terminated by the AM) The changes to achieve this are minimal and they do not alter at all the lifecycle of a container, nor in the RM, nor in the NM. As previously mentioned by Bikas, this can be seen as a special case of the functionality that YARN-1040 is proposing for managing multiple processes with the same container. The scope of work of YARN-1040 is significantly larger and requires API changes, while this JIRA does not require API changes and the changes are not incompatible with each other.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          I had a discussion with Alejandro Abdelnur offline last week. Here's the summary

          Requirements

          1. Run YARN side-by-side with external services but use the same scheduling path
          2. 'Containers' thus allocated will no longer map to any processes that are run directly under YARN - thus the name unmanaged containers

          Design points

          • It is not enough to run these external services directly under YARN due to the desire to have 'queries' submitted to these external services obey the same queuing policies, scheduling decisions as any other YARN application

          Details

          • YARN-1040 should enable starting unmanaged containers.
          • Though YARN-1040 enables this, it is not sufficient. We need more restrictions on unmanaged containers similar to their managed counter-parts.
          • ACLs: YARN depends on the ability to enforce resource-usage restrictions. It cannot do so with unamanged containers and due to this, we need the external framework running unmanaged containers to be trusted.
            • This can be done by making sure that only specific users can start applications that can request unmanaged containers.
          • The same trust is required for RM to sanely depend on evacuation of containers - on preemption, user request etc.
          • Liveliness: YARN implicitly depends on knowing the liveliness of the containers by way of an implicit lease created by process liveliness.
            • This changes with unmanaged containers and so they should periodically 'renew' the lease via a new ContainerManagmentProtocol API.
            • Unmanaged containers that don't renew for a while are deemed dead.
            • YARN knows when a container exits. The only way to simulate this for unmanaged containers is via an explicit stopContainer signal.
          • NodeManagers assume a whole lot of things about containers
            • distributed cache: unmanaged containers don't need any resources to be localized?
            • setting of security tokens: unmanaged containers don't need any file-system tokens?
            • management of logs generated by containers: unmanaged containers don't generate any logs? So NM doesn't need to do anything?
          • Can such trusted application mix and match managed and unmanaged containers?

          Miscellaneous points

          • Unmanaged containers will be disabled by default

          What do others think?

          Show
          Vinod Kumar Vavilapalli added a comment - I had a discussion with Alejandro Abdelnur offline last week. Here's the summary Requirements Run YARN side-by-side with external services but use the same scheduling path 'Containers' thus allocated will no longer map to any processes that are run directly under YARN - thus the name unmanaged containers Design points It is not enough to run these external services directly under YARN due to the desire to have 'queries' submitted to these external services obey the same queuing policies, scheduling decisions as any other YARN application Details YARN-1040 should enable starting unmanaged containers. Though YARN-1040 enables this, it is not sufficient. We need more restrictions on unmanaged containers similar to their managed counter-parts. ACLs : YARN depends on the ability to enforce resource-usage restrictions. It cannot do so with unamanged containers and due to this, we need the external framework running unmanaged containers to be trusted. This can be done by making sure that only specific users can start applications that can request unmanaged containers. The same trust is required for RM to sanely depend on evacuation of containers - on preemption, user request etc. Liveliness : YARN implicitly depends on knowing the liveliness of the containers by way of an implicit lease created by process liveliness. This changes with unmanaged containers and so they should periodically 'renew' the lease via a new ContainerManagmentProtocol API. Unmanaged containers that don't renew for a while are deemed dead. YARN knows when a container exits. The only way to simulate this for unmanaged containers is via an explicit stopContainer signal. NodeManagers assume a whole lot of things about containers distributed cache: unmanaged containers don't need any resources to be localized? setting of security tokens: unmanaged containers don't need any file-system tokens? management of logs generated by containers: unmanaged containers don't generate any logs? So NM doesn't need to do anything? Can such trusted application mix and match managed and unmanaged containers? Miscellaneous points Unmanaged containers will be disabled by default What do others think?
          Hide
          Alejandro Abdelnur added a comment - - edited

          Vinod Kumar Vavilapalli, thanks for summarizing our offline chat.

          Regarding ACLs and an on/off switch:

          IMO they are not necessary for the following reason.

          You need an external system installed and running in the node to use the resources of an unmanaged container. If you have direct access into the node to start the external system, you are 'trusted'. If you don't have direct access you cannot use the resources of an unmanaged container.

          I think this is a very strong requirement already and it would avoid adding code to manage a new ACL and an on/off switch.

          Regarding Liveliness:

          In the case of managed containers we don't have a liveliness 'report' and the container process could very well be hung. In such scenario is the responsibility of the AM to detected the liveliness of the container process and react if it is considered hung.

          In the case of unmanaged containers, the AM would have the same responsibility.

          The only difference is that in the case of managed containers if the process exits the NM detects that, while in the case of unmanaged containers this responsibility would fall on the AM.

          Because of this I think we could do without a leaseRenewal/liveliness call.

          Regarding NM assume a whole lot of things about containers 3 bullet items:

          For the my current use case none if this is needed. It could be relatively easy to enable such functionality if a use case that needs it arises.

          Regarding Can such trusted application mix and match managed and unmanaged containers?:

          In the way I envision how this would work, when an AM asks for a container and gets an allocation for from the RM, the RM does not know if the AM will start a managed or an unmanaged container. It is only between the AM and the NM that this is known, when the ContainerLaunchContext is NULL.

          Regarding YARN-1040 should enabled starting unmanaged containers:

          If YARN-1040 would be implemented, yes, it would enable unmanaged containers. However the scope of YARN-1040 is much bigger than unmanaged containers.

          It should be also be possible implementing unmanaged containers as being discussed and later implement YARN-1040.

          Does this make sense?

          Show
          Alejandro Abdelnur added a comment - - edited Vinod Kumar Vavilapalli , thanks for summarizing our offline chat. Regarding ACLs and an on/off switch : IMO they are not necessary for the following reason. You need an external system installed and running in the node to use the resources of an unmanaged container. If you have direct access into the node to start the external system, you are 'trusted'. If you don't have direct access you cannot use the resources of an unmanaged container. I think this is a very strong requirement already and it would avoid adding code to manage a new ACL and an on/off switch. Regarding Liveliness : In the case of managed containers we don't have a liveliness 'report' and the container process could very well be hung. In such scenario is the responsibility of the AM to detected the liveliness of the container process and react if it is considered hung. In the case of unmanaged containers, the AM would have the same responsibility. The only difference is that in the case of managed containers if the process exits the NM detects that, while in the case of unmanaged containers this responsibility would fall on the AM. Because of this I think we could do without a leaseRenewal/liveliness call. Regarding NM assume a whole lot of things about containers 3 bullet items: For the my current use case none if this is needed. It could be relatively easy to enable such functionality if a use case that needs it arises. Regarding Can such trusted application mix and match managed and unmanaged containers? : In the way I envision how this would work, when an AM asks for a container and gets an allocation for from the RM, the RM does not know if the AM will start a managed or an unmanaged container. It is only between the AM and the NM that this is known, when the ContainerLaunchContext is NULL. Regarding YARN-1040 should enabled starting unmanaged containers : If YARN-1040 would be implemented, yes, it would enable unmanaged containers. However the scope of YARN-1040 is much bigger than unmanaged containers. It should be also be possible implementing unmanaged containers as being discussed and later implement YARN-1040 . Does this make sense?
          Hide
          Arun C Murthy added a comment -

          I've spent time thinking about this in the context of running a myriad of external systems in YARN such as Impala, HDFS Caching (HDFS-4949) and some others.

          The overarching goal is to allow YARN to act as a ResourceManager for the overall cluster and a Workload Manager for external systems i.e. this way Impala or HDFS can rely on YARN's queues for workload management, SLAs via preemption etc.

          Is that a good characterization of the problem at hand?

          I think it's a good goal to support - this will allow other external systems to leverage YARN's capabilities for both resource sharing and workload management.

          Now, if we all agree on this - we can figure the best way to support this in a first-class manner.


          Ok, the core requirement is for an external system (Impala, HDFS, others) to leverage YARN's workload management capabilities (queues etc.) to acquire resources (cpu, memory) on behalf of a particular entity (user, queue) for completing a user's request (run a query, cache a dataset in RAM).

          The key is that these external systems need to acquire resources on behalf of the user and ensure that the chargeback is applied to the correct user, queue etc.

          This is a brand new requirement for YARN... so far, we have assumed that the entity acquiring the resource would also be actually utilizing the resource by launching a container etc.

          Here, it's clear that the requirement is that entity acquiring the resource would like to delegate the resource to an external framework. For e.g.

          1. A user query would like to acquire cpu, memory etc. for appropriate accounting chargeback and then delegate it to Impala.
          2. A user request for caching data would like to acquire memory for appropriate accounting chargeback and then delegate to the Datanode.

          In this scenario, I think explicitly allowing for delegation of a container would solve the problem in a first-class manner.

          We should add a new API to the NodeManager which would allow an application to delegate a container's resources to a different container:

          ContainerManagementProtocol.java
            
          public interface ContainerManagementProtocol {
            // ...
            public DelegateContainerResponse delegateContainer(DelegateContainerRequest request);
            // ...
          }
          
          DelegateContainerRequest.java
            
          public abstract class DelegateContainerRequest {
            // ...
            public ContainerLaunchContext getSourceContainer();
          
            public ContainerId getTargetContainer();
            // ...
          }
          

          The implementation of this api would notify the NodeManager to change it's monitoring on the recipient container i.e. Impala or Datanode by modifying cgroup of the recipient container.

          Similarly, the NodeManager could be instructed by the ResourceManager to preempt the resources of the source container for continuing to serve the global SLAs of the queues - again, this is implemented by modifying the cgroup of the recipient container. This will allow for ResouceManager/NodeManager to be explicitly in control of resources, even in the face of misbehaving AMs etc.


          The result of the above proposal is very similar to what is already being discussed, the only difference being that this is explicit (NodeManager knows the source and recipient containers) and this allows for all existing features such as preemption, over-allocation of resources to YARN queues etc. to continue to work as today.


          Thoughts?

          Show
          Arun C Murthy added a comment - I've spent time thinking about this in the context of running a myriad of external systems in YARN such as Impala, HDFS Caching ( HDFS-4949 ) and some others. The overarching goal is to allow YARN to act as a ResourceManager for the overall cluster and a Workload Manager for external systems i.e. this way Impala or HDFS can rely on YARN's queues for workload management, SLAs via preemption etc. Is that a good characterization of the problem at hand? I think it's a good goal to support - this will allow other external systems to leverage YARN's capabilities for both resource sharing and workload management. Now, if we all agree on this - we can figure the best way to support this in a first-class manner. Ok, the core requirement is for an external system (Impala, HDFS, others) to leverage YARN's workload management capabilities (queues etc.) to acquire resources (cpu, memory) on behalf of a particular entity (user, queue) for completing a user's request (run a query, cache a dataset in RAM). The key is that these external systems need to acquire resources on behalf of the user and ensure that the chargeback is applied to the correct user, queue etc. This is a brand new requirement for YARN... so far, we have assumed that the entity acquiring the resource would also be actually utilizing the resource by launching a container etc. Here, it's clear that the requirement is that entity acquiring the resource would like to delegate the resource to an external framework. For e.g. A user query would like to acquire cpu, memory etc. for appropriate accounting chargeback and then delegate it to Impala. A user request for caching data would like to acquire memory for appropriate accounting chargeback and then delegate to the Datanode. In this scenario, I think explicitly allowing for delegation of a container would solve the problem in a first-class manner. We should add a new API to the NodeManager which would allow an application to delegate a container's resources to a different container: ContainerManagementProtocol.java public interface ContainerManagementProtocol { // ... public DelegateContainerResponse delegateContainer(DelegateContainerRequest request); // ... } DelegateContainerRequest.java public abstract class DelegateContainerRequest { // ... public ContainerLaunchContext getSourceContainer(); public ContainerId getTargetContainer(); // ... } The implementation of this api would notify the NodeManager to change it's monitoring on the recipient container i.e. Impala or Datanode by modifying cgroup of the recipient container. Similarly, the NodeManager could be instructed by the ResourceManager to preempt the resources of the source container for continuing to serve the global SLAs of the queues - again, this is implemented by modifying the cgroup of the recipient container. This will allow for ResouceManager/NodeManager to be explicitly in control of resources, even in the face of misbehaving AMs etc. The result of the above proposal is very similar to what is already being discussed, the only difference being that this is explicit (NodeManager knows the source and recipient containers) and this allows for all existing features such as preemption, over-allocation of resources to YARN queues etc. to continue to work as today. Thoughts?
          Hide
          Arun C Murthy added a comment -

          I've opened YARN-1488 to track delegation of container resources.

          Show
          Arun C Murthy added a comment - I've opened YARN-1488 to track delegation of container resources.
          Hide
          Bikas Saha added a comment -

          Is the scenario having containers from multiple users asking for resources within their quota and then delegating them to a shared service to use on their behalf. The above would imply that datanode/impala/others would be running as yarn containers so that they can be targets for delegation.

          Show
          Bikas Saha added a comment - Is the scenario having containers from multiple users asking for resources within their quota and then delegating them to a shared service to use on their behalf. The above would imply that datanode/impala/others would be running as yarn containers so that they can be targets for delegation.
          Hide
          Arun C Murthy added a comment -

          Yes, agreed. Sorry, I thought it was clear that was what I proposing with:

          The implementation of this api would notify the NodeManager to change it's monitoring on the recipient container i.e. Impala or Datanode by modifying cgroup of the recipient container.
          Similarly, the NodeManager could be instructed by the ResourceManager to preempt the resources of the source container for continuing to serve the global SLAs of the queues - again, this is implemented by modifying the cgroup of the recipient container. This will allow for ResouceManager/NodeManager to be explicitly in control of resources, even in the face of misbehaving AMs etc.

          Show
          Arun C Murthy added a comment - Yes, agreed. Sorry, I thought it was clear that was what I proposing with: The implementation of this api would notify the NodeManager to change it's monitoring on the recipient container i.e. Impala or Datanode by modifying cgroup of the recipient container. Similarly, the NodeManager could be instructed by the ResourceManager to preempt the resources of the source container for continuing to serve the global SLAs of the queues - again, this is implemented by modifying the cgroup of the recipient container. This will allow for ResouceManager/NodeManager to be explicitly in control of resources, even in the face of misbehaving AMs etc.
          Hide
          Sandy Ryza added a comment -

          Arun, I think I agree with most of the above and your proposal makes a lot of sense to me.

          There are numerous issues to tackle. On the YARN side:

          • YARN has assumed since its inception that a container's resources belong to a single application - we are likely to come across many subtle issues when rethinking this assumption.
          • While YARN has promise as a platform for deploying long-running services, that functionality currently isn't stable in the way that much of the rest of YARN is.
          • Currently preemption means killing a container process - we would need to change the way this mechanism works.

          On the Datanode/Impala side:

          • Rethink the way we deploy these services to allow them to run inside YARN containers.

          Stepping back a little, YARN does three things:

          • Central Scheduling - decides who gets to run and when and where they get to do so
          • Deployment - ships bits across the cluster and runs container processes
          • Enforcement - monitors container processes to make sure they stay within scheduled limits

          The central scheduling part is the most valuable to a framework like Impala because it allows it to truly share resources on a cluster with other processing frameworks. The second two are helpful - they allow us to standardize the way work is deployed on a Hadoop cluster - but they aren't enabling things that's fundamentally impossible without them. While these will simplify things in the long term and create a more cohesive platform, Impala currently has little tangible to gain by doing deployment and enforcement inside YARN.

          So, to summarize, I like the idea and would be both happy to see YARN move in this direction and to help it do so. However, making Impala-YARN integration depend on this fairly involved work would unnecessarily set it back. In the short term, we have proposed a minimally invasive change (making it possible to launch containers without starting processes) that would allow YARN to satisfy our use case. I am confident that the change poses no risk from a security perspective, from a stability perspective, or in terms of detracting from the longer-term vision.

          Show
          Sandy Ryza added a comment - Arun, I think I agree with most of the above and your proposal makes a lot of sense to me. There are numerous issues to tackle. On the YARN side: YARN has assumed since its inception that a container's resources belong to a single application - we are likely to come across many subtle issues when rethinking this assumption. While YARN has promise as a platform for deploying long-running services, that functionality currently isn't stable in the way that much of the rest of YARN is. Currently preemption means killing a container process - we would need to change the way this mechanism works. On the Datanode/Impala side: Rethink the way we deploy these services to allow them to run inside YARN containers. Stepping back a little, YARN does three things: Central Scheduling - decides who gets to run and when and where they get to do so Deployment - ships bits across the cluster and runs container processes Enforcement - monitors container processes to make sure they stay within scheduled limits The central scheduling part is the most valuable to a framework like Impala because it allows it to truly share resources on a cluster with other processing frameworks. The second two are helpful - they allow us to standardize the way work is deployed on a Hadoop cluster - but they aren't enabling things that's fundamentally impossible without them. While these will simplify things in the long term and create a more cohesive platform, Impala currently has little tangible to gain by doing deployment and enforcement inside YARN. So, to summarize, I like the idea and would be both happy to see YARN move in this direction and to help it do so. However, making Impala-YARN integration depend on this fairly involved work would unnecessarily set it back. In the short term, we have proposed a minimally invasive change (making it possible to launch containers without starting processes) that would allow YARN to satisfy our use case. I am confident that the change poses no risk from a security perspective, from a stability perspective, or in terms of detracting from the longer-term vision.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Re Tucu's reply

          Regarding ACLs and an on/off switch: IMO they are not necessary for the following reason. You need an external system installed and running in the node to use the resources of an unmanaged container. If you have direct access into the node to start the external system, you are 'trusted'. If you don't have direct access you cannot use the resources of an unmanaged container.

          Unfortunately that is not enough. We are exposing an API on NodeManager that anybody can use. The ACL prevents that.

          In the case of managed containers we don't have a liveliness 'report' and the container process could very well be hung. In such scenario is the responsibility of the AM to detected the liveliness of the container process and react if it is considered hung.

          Like I said, we do have an implicit liveliness report - process liveliness. And NodeManager depends on that today to inform the app of container-finishes.

          Regarding NM assume a whole lot of things about containers 3 bullet items: For the my current use case none if this is needed. It could be relatively easy to enable such functionality if a use case that needs it arises.

          So, then we start off with the assumption that they are not needed? That creates two very different code paths for managed and unmanded containers. If possible we should avoid that.

          Show
          Vinod Kumar Vavilapalli added a comment - Re Tucu's reply Regarding ACLs and an on/off switch: IMO they are not necessary for the following reason. You need an external system installed and running in the node to use the resources of an unmanaged container. If you have direct access into the node to start the external system, you are 'trusted'. If you don't have direct access you cannot use the resources of an unmanaged container. Unfortunately that is not enough. We are exposing an API on NodeManager that anybody can use. The ACL prevents that. In the case of managed containers we don't have a liveliness 'report' and the container process could very well be hung. In such scenario is the responsibility of the AM to detected the liveliness of the container process and react if it is considered hung. Like I said, we do have an implicit liveliness report - process liveliness. And NodeManager depends on that today to inform the app of container-finishes. Regarding NM assume a whole lot of things about containers 3 bullet items: For the my current use case none if this is needed. It could be relatively easy to enable such functionality if a use case that needs it arises. So, then we start off with the assumption that they are not needed? That creates two very different code paths for managed and unmanded containers. If possible we should avoid that.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          In this scenario, I think explicitly allowing for delegation of a container would solve the problem in a first-class manner.

          This is an interesting solution that avoids the problems about trust, liveliness reporting and resource limitations' enforcement. +1 for considering something like this.

          Show
          Vinod Kumar Vavilapalli added a comment - In this scenario, I think explicitly allowing for delegation of a container would solve the problem in a first-class manner. This is an interesting solution that avoids the problems about trust, liveliness reporting and resource limitations' enforcement. +1 for considering something like this.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Stepping back a little, YARN does three things:
          Central Scheduling - decides who gets to run and when and where they get to do so
          Deployment - ships bits across the cluster and runs container processes
          Enforcement - monitors container processes to make sure they stay within scheduled limits
          The central scheduling part is the most valuable to a framework like Impala because it allows it to truly share resources on a cluster with other processing frameworks. The second two are helpful - they allow us to standardize the way work is deployed on a Hadoop cluster - but they aren't enabling things that's fundamentally impossible without them. While these will simplify things in the long term and create a more cohesive platform, Impala currently has little tangible to gain by doing deployment and enforcement inside YARN.

          Don't agree with that characterization. The thing is to enable only central scheduling, YARN has to give up its control over liveliness & enforcement and needs to create a new level of trust. If there are alternative architectures that will avoid losing that control, YARN will chose those options. The question is whether external systems want to take that option or not.

          Show
          Vinod Kumar Vavilapalli added a comment - Stepping back a little, YARN does three things: Central Scheduling - decides who gets to run and when and where they get to do so Deployment - ships bits across the cluster and runs container processes Enforcement - monitors container processes to make sure they stay within scheduled limits The central scheduling part is the most valuable to a framework like Impala because it allows it to truly share resources on a cluster with other processing frameworks. The second two are helpful - they allow us to standardize the way work is deployed on a Hadoop cluster - but they aren't enabling things that's fundamentally impossible without them. While these will simplify things in the long term and create a more cohesive platform, Impala currently has little tangible to gain by doing deployment and enforcement inside YARN. Don't agree with that characterization. The thing is to enable only central scheduling, YARN has to give up its control over liveliness & enforcement and needs to create a new level of trust. If there are alternative architectures that will avoid losing that control, YARN will chose those options. The question is whether external systems want to take that option or not.
          Hide
          Sandy Ryza added a comment -

          The thing is to enable only central scheduling, YARN has to give up its control over liveliness & enforcement and needs to create a new level of trust.

          I'm not sure I entirely understand what you mean by create a new level of trust. We are a long way from YARN managing all resources on a Hadoop cluster. YARN implicitly understands that other trusted processes will be running alongside it. The proposed change does not grant any users the ability to use any resources without going through a framework trusted by the cluster administrator.

          Like I said, we do have an implicit liveliness report - process liveliness. And NodeManager depends on that today to inform the app of container-finishes.

          It depends on that or the AM releasing the resources. Process liveliness is a very imperfect signifier - a process can stick around due to an accidentally-not-finished-thread even when all its work is done. I have seen clusters where all MR task processes are killed by the AM without exiting naturally and everything works fine.

          I've tried to think through situations where this could be harmful:
          Malicious application intentionally sits on cluster resources: They can do this already by running a process with sleep(infinity)
          Application unintentionally sits on cluster resources: This can already happen if a container process forgets to terminate a non-daemon thread.
          In both cases, preemption will prohibit an application from sitting on resources above its fair share.

          Is there a scenario I'm missing here?

          If there are alternative architectures that will avoid losing that control, YARN will chose those options.

          YARN is not a power-hungry conscious entity that gets to make decisions for us. We as YARN committers and contributors get to decide what use cases we want to support, and we don't need to choose a single one. We should of course be careful with what we choose to support, but we should be restrictive when there are concrete consequences of doing otherwise. Not simply when a use case violates the abstract idea of YARN controlling everything.

          If the deeper concern is that Impala and similar frameworks will opt not to run fully inside YARN when that functionality is available, I think we would be happy to switch over when YARN supports this in a stable manner. However, I believe this is a long way away and depending on that work is not an option for us.

          Show
          Sandy Ryza added a comment - The thing is to enable only central scheduling, YARN has to give up its control over liveliness & enforcement and needs to create a new level of trust. I'm not sure I entirely understand what you mean by create a new level of trust. We are a long way from YARN managing all resources on a Hadoop cluster. YARN implicitly understands that other trusted processes will be running alongside it. The proposed change does not grant any users the ability to use any resources without going through a framework trusted by the cluster administrator. Like I said, we do have an implicit liveliness report - process liveliness. And NodeManager depends on that today to inform the app of container-finishes. It depends on that or the AM releasing the resources. Process liveliness is a very imperfect signifier - a process can stick around due to an accidentally-not-finished-thread even when all its work is done. I have seen clusters where all MR task processes are killed by the AM without exiting naturally and everything works fine. I've tried to think through situations where this could be harmful: Malicious application intentionally sits on cluster resources: They can do this already by running a process with sleep(infinity) Application unintentionally sits on cluster resources: This can already happen if a container process forgets to terminate a non-daemon thread. In both cases, preemption will prohibit an application from sitting on resources above its fair share. Is there a scenario I'm missing here? If there are alternative architectures that will avoid losing that control, YARN will chose those options. YARN is not a power-hungry conscious entity that gets to make decisions for us. We as YARN committers and contributors get to decide what use cases we want to support, and we don't need to choose a single one. We should of course be careful with what we choose to support, but we should be restrictive when there are concrete consequences of doing otherwise. Not simply when a use case violates the abstract idea of YARN controlling everything. If the deeper concern is that Impala and similar frameworks will opt not to run fully inside YARN when that functionality is available, I think we would be happy to switch over when YARN supports this in a stable manner. However, I believe this is a long way away and depending on that work is not an option for us.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          I'm not sure I entirely understand what you mean by create a new level of trust.

          I thought that was already clear to everyone. See my comment here. "YARN depends on the ability to enforce resource-usage restrictions".

          YARN enables both resource scheduling and enforcement of those scheduling decisions. If resources sit outside of YARN, YARN cannot enforce the limits on their usage. For e.g, YARN cannot enforce the memory usage of a datanode. People may work around it by setting up Cgroups on these daemons, but that defeats the purpose of YARN in the first place. That is why I earlier proposed that impala/datanode run under YARN. When I couldn't find a solution otherwise, I revised my proposal to restrict it to be used with a special ACL so that other apps don't abuse the cluster by requesting unmanaged containers and not using those resources.

          It depends on that or the AM releasing the resources. Process liveliness is a very imperfect signifier ...

          We cannot trust AMs to always release containers. If it were so imperfect, we should change YARN as it is today to not depend on liveliness. I'd leave it as an exercise to see how, once we remove process-liveliness in general, apps will release containers and how clusters get utilized. Bonus points for trying it on a shared multi-tenant cluster with user-written YARN apps.

          My point is that Process liveliness + accounting based on that is a very understood model in the Hadoop land. The proposal for leases is to continue that.

          Is there a scenario I'm missing here?

          One example that illustrates this. Today AMs can go away without releasing containers and YARN can kill the corresponding containers(as they are managed). If we don't have some kind of leases, and AMs that are unmanaged resources go away without explicit container-release, those resources are leaked.

          YARN is not a power-hungry conscious entity that gets to make decisions for us. Not simply when a use case violates the abstract idea of YARN controlling everything. [...]

          Of course, when I mean YARN, I mean the YARN community. You take it too literally.

          I was pointing out your statements about "Impala currently has little tangible to gain by doing deployment and enforcement inside YARN", "However, making Impala-YARN integration depend on this fairly involved work would unnecessarily set it back". YARN community doesn't take decisions based on those things.

          Overall, I didn't originally have a complete solution for making it happen - so came up with ACLs, leases. But delegation as proposed by Arun seems like one that solves all the problems. Other than saying you don't want to wait for impala-under-YARN integration, I haven't heard any technical reservations against this approach.

          Show
          Vinod Kumar Vavilapalli added a comment - I'm not sure I entirely understand what you mean by create a new level of trust. I thought that was already clear to everyone. See my comment here . "YARN depends on the ability to enforce resource-usage restrictions". YARN enables both resource scheduling and enforcement of those scheduling decisions. If resources sit outside of YARN, YARN cannot enforce the limits on their usage. For e.g, YARN cannot enforce the memory usage of a datanode. People may work around it by setting up Cgroups on these daemons, but that defeats the purpose of YARN in the first place. That is why I earlier proposed that impala/datanode run under YARN. When I couldn't find a solution otherwise, I revised my proposal to restrict it to be used with a special ACL so that other apps don't abuse the cluster by requesting unmanaged containers and not using those resources. It depends on that or the AM releasing the resources. Process liveliness is a very imperfect signifier ... We cannot trust AMs to always release containers. If it were so imperfect, we should change YARN as it is today to not depend on liveliness. I'd leave it as an exercise to see how, once we remove process-liveliness in general, apps will release containers and how clusters get utilized. Bonus points for trying it on a shared multi-tenant cluster with user-written YARN apps. My point is that Process liveliness + accounting based on that is a very understood model in the Hadoop land. The proposal for leases is to continue that. Is there a scenario I'm missing here? One example that illustrates this. Today AMs can go away without releasing containers and YARN can kill the corresponding containers(as they are managed). If we don't have some kind of leases, and AMs that are unmanaged resources go away without explicit container-release, those resources are leaked. YARN is not a power-hungry conscious entity that gets to make decisions for us. Not simply when a use case violates the abstract idea of YARN controlling everything. [...] Of course, when I mean YARN, I mean the YARN community. You take it too literally. I was pointing out your statements about "Impala currently has little tangible to gain by doing deployment and enforcement inside YARN", "However, making Impala-YARN integration depend on this fairly involved work would unnecessarily set it back". YARN community doesn't take decisions based on those things. Overall, I didn't originally have a complete solution for making it happen - so came up with ACLs, leases. But delegation as proposed by Arun seems like one that solves all the problems. Other than saying you don't want to wait for impala-under-YARN integration, I haven't heard any technical reservations against this approach.
          Hide
          Sandy Ryza added a comment -

          Other than saying you don't want to wait for impala-under-YARN integration, I haven't heard any technical reservations against this approach.

          I have no technical reservations with the overall approach. In fact I'm in favor of it. My points are:

          • We will not see this happen for a while and that the original approach on this JIRA supports a workaround that has no consequences for clusters not running Impala on YARN.
          • I'm sure many that would love to take advantage of centrally resource-managed HDFS caching will be unwilling to deploy HDFS through YARN. This will go for all sorts of legacy applications as well. If, beside the changes Arun proposed, we can expose YARN's central scheduling independent from its deployment/enforcement, there would be a lot to gain. If this is within easy reach, I don't find arguments that YARN is philosophically opposed to it or that the additional freedom would allow cluster-configurers to shoot themselves in the foot satisfying.

          I realize that we are rehashing many of the same arguments so I'm not sure how to make progress on this. I'll wait until Tucu returns from vacation to push further.

          Show
          Sandy Ryza added a comment - Other than saying you don't want to wait for impala-under-YARN integration, I haven't heard any technical reservations against this approach. I have no technical reservations with the overall approach. In fact I'm in favor of it. My points are: We will not see this happen for a while and that the original approach on this JIRA supports a workaround that has no consequences for clusters not running Impala on YARN. I'm sure many that would love to take advantage of centrally resource-managed HDFS caching will be unwilling to deploy HDFS through YARN. This will go for all sorts of legacy applications as well. If, beside the changes Arun proposed, we can expose YARN's central scheduling independent from its deployment/enforcement, there would be a lot to gain. If this is within easy reach, I don't find arguments that YARN is philosophically opposed to it or that the additional freedom would allow cluster-configurers to shoot themselves in the foot satisfying. I realize that we are rehashing many of the same arguments so I'm not sure how to make progress on this. I'll wait until Tucu returns from vacation to push further.
          Hide
          Arun C Murthy added a comment -

          I have no technical reservations with the overall approach.

          Since we agree on the approach and the direction we want to go; perhaps, we can now discuss how to get there?

          We don't have to implement everything in the first go, we just need to implement enough to solve your goals of quick integration while being on the long-term path we want to get to.

          Does that make sense?

          Show
          Arun C Murthy added a comment - I have no technical reservations with the overall approach. Since we agree on the approach and the direction we want to go; perhaps, we can now discuss how to get there? We don't have to implement everything in the first go, we just need to implement enough to solve your goals of quick integration while being on the long-term path we want to get to. Does that make sense?
          Hide
          Vinod Kumar Vavilapalli added a comment -

          I just caught up with YARN-1197. Seems like some part of that solution is very relevant to this JIRA. For e.g.,

          Some daemon-based applications may want to start exactly one daemon in allocated node (like OpenMPI), such daemon will launch/monitoring workers (like MPI processes) itself. We can first allocate some containers for daemons, and adjust their size as application’s requirement. This will make YARN support two-staged scheduling. Described in YARN-1197

          Show
          Vinod Kumar Vavilapalli added a comment - I just caught up with YARN-1197 . Seems like some part of that solution is very relevant to this JIRA. For e.g., Some daemon-based applications may want to start exactly one daemon in allocated node (like OpenMPI), such daemon will launch/monitoring workers (like MPI processes) itself. We can first allocate some containers for daemons, and adjust their size as application’s requirement. This will make YARN support two-staged scheduling. Described in YARN-1197
          Hide
          Alejandro Abdelnur added a comment -

          [doing self-clean up of JIRAs], found a different way of doing what I needed to do.

          Show
          Alejandro Abdelnur added a comment - [doing self-clean up of JIRAs] , found a different way of doing what I needed to do.

            People

            • Assignee:
              Alejandro Abdelnur
              Reporter:
              Alejandro Abdelnur
            • Votes:
              0 Vote for this issue
              Watchers:
              29 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development