Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-2877

Extend YARN to support distributed scheduling

    Details

    • Target Version/s:
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      With this JIRA we are introducing distributed scheduling in YARN.
      In particular, we make the following contributions:
      - Introduce the notion of container types. GUARANTEED containers follow the semantics of the existing YARN containers. OPPORTUNISTIC ones can be seen as lower priority containers, and can be preempted in order to make space for GUARANTEED containers to run.
      - Queuing of tasks at the NMs. This enables us to send more containers in an NM than its available resources. At the moment we are allowing queuing of OPPORTUNISTIC containers. Once resources become available at the NM, such containers can immediately start their execution.
      - Introduce the AMRMProxy. This is a service running at each node, intercepting the requests between the AM and the RM. It is instrumental for both distributed scheduling and YARN Federation (YARN-2915).
      - Enable distributed scheduling. To minimize their allocation latency, OPPORTUNISTIC containers are dispatched immediately to NMs in a distributed fashion by using the AMRMProxy of the node where the corresponding AM resides, without needing to go through the ResourceManager.

      All the functionality introduced in this JIRA is disabled by default, so it will not affect the behavior of existing applications.
      We have introduced parameters in YarnConfiguration to enable NM queuing (yarn.nodemanager.container-queuing-enabled), distributed scheduling (yarn.distributed-scheduling.enabled) and the AMRMProxy service (yarn.nodemanager.amrmproxy.enable).
      AMs currently need to specify the type of container to be requested for each task. We are in the process of adding in the MapReduce AM the ability to randomly request OPPORTUNISTIC containers for a specified percentage of a job's tasks, so that users can experiment with the new features.
      Show
      With this JIRA we are introducing distributed scheduling in YARN. In particular, we make the following contributions: - Introduce the notion of container types. GUARANTEED containers follow the semantics of the existing YARN containers. OPPORTUNISTIC ones can be seen as lower priority containers, and can be preempted in order to make space for GUARANTEED containers to run. - Queuing of tasks at the NMs. This enables us to send more containers in an NM than its available resources. At the moment we are allowing queuing of OPPORTUNISTIC containers. Once resources become available at the NM, such containers can immediately start their execution. - Introduce the AMRMProxy. This is a service running at each node, intercepting the requests between the AM and the RM. It is instrumental for both distributed scheduling and YARN Federation ( YARN-2915 ). - Enable distributed scheduling. To minimize their allocation latency, OPPORTUNISTIC containers are dispatched immediately to NMs in a distributed fashion by using the AMRMProxy of the node where the corresponding AM resides, without needing to go through the ResourceManager. All the functionality introduced in this JIRA is disabled by default, so it will not affect the behavior of existing applications. We have introduced parameters in YarnConfiguration to enable NM queuing (yarn.nodemanager.container-queuing-enabled), distributed scheduling (yarn.distributed-scheduling.enabled) and the AMRMProxy service (yarn.nodemanager.amrmproxy.enable). AMs currently need to specify the type of container to be requested for each task. We are in the process of adding in the MapReduce AM the ability to randomly request OPPORTUNISTIC containers for a specified percentage of a job's tasks, so that users can experiment with the new features.

      Description

      This is an umbrella JIRA that proposes to extend YARN to support distributed scheduling. Briefly, some of the motivations for distributed scheduling are the following:
      1. Improve cluster utilization by opportunistically executing tasks otherwise idle resources on individual machines.
      2. Reduce allocation latency. Tasks where the scheduling time dominates (i.e., task execution time is much less compared to the time required for obtaining a container from the RM).

        Issue Links

          Activity

          Hide
          asuresh Arun Suresh added a comment -

          Ah.. makes sense.. thanks for the clarification Junping Du

          Show
          asuresh Arun Suresh added a comment - Ah.. makes sense.. thanks for the clarification Junping Du
          Hide
          ajisakaa Akira Ajisaka added a comment -

          Hi Arun Suresh, would you update the fix version field of the cherry-picked jiras?
          In addition, CHANGES.txt was added when cherry-picking YARN-2882. I filed YARN-5126 to remove it.

          Show
          ajisakaa Akira Ajisaka added a comment - Hi Arun Suresh , would you update the fix version field of the cherry-picked jiras? In addition, CHANGES.txt was added when cherry-picking YARN-2882 . I filed YARN-5126 to remove it.
          Hide
          djp Junping Du added a comment -

          Thanks for investigation Wangda Tan and Junping Du

          The most investigation work is done by Wangda. We should put all credit to him.

          Not sure if I understand correctly, are you proposing that we should NOT declare new fields in sequence ? for eg. if the last field index is 10 for a struct in trunk, if we want to add a new field, we should set it as something like 15 and not 11 ?

          I think what Wangda's propose above is: next time when we meet the same situation: patch 1 go to trunk first but not branch-2, patch 2 need to go to branch-2 and they all change field of the same proto (assume patch 1's field id = 2, patch 2's field id =3 on trunk). We don't necessary to adjust the sequence in trunk any more like we do it earlier. Instead, on branch-2, we can keep patch 2's filed Id =3 and skip id = 2 which is reserved for patch 1 to commit to branch-2 in future.
          That can save our lives from possible incompatible commits due to branch differences.

          Show
          djp Junping Du added a comment - Thanks for investigation Wangda Tan and Junping Du The most investigation work is done by Wangda. We should put all credit to him. Not sure if I understand correctly, are you proposing that we should NOT declare new fields in sequence ? for eg. if the last field index is 10 for a struct in trunk, if we want to add a new field, we should set it as something like 15 and not 11 ? I think what Wangda's propose above is: next time when we meet the same situation: patch 1 go to trunk first but not branch-2, patch 2 need to go to branch-2 and they all change field of the same proto (assume patch 1's field id = 2, patch 2's field id =3 on trunk). We don't necessary to adjust the sequence in trunk any more like we do it earlier. Instead, on branch-2, we can keep patch 2's filed Id =3 and skip id = 2 which is reserved for patch 1 to commit to branch-2 in future. That can save our lives from possible incompatible commits due to branch differences.
          Hide
          asuresh Arun Suresh added a comment -

          Jian He, I just cherry-picked all sub-task patches from trunk to branch-2. Do let me know if you hit any issues.

          Show
          asuresh Arun Suresh added a comment - Jian He , I just cherry-picked all sub-task patches from trunk to branch-2. Do let me know if you hit any issues.
          Hide
          asuresh Arun Suresh added a comment -

          Thanks for investigation Wangda Tan and Junping Du

          So next time we should not update sequence of fields in trunk/branch-2, what we need to do is to make sure fields of protos across branches has same id.

          Not sure if I understand correctly, are you proposing that we should NOT declare new fields in sequence ? for eg. if the last field index is 10 for a struct in trunk, if we want to add a new field, we should set it as something like 15 and not 11 ?

          Show
          asuresh Arun Suresh added a comment - Thanks for investigation Wangda Tan and Junping Du So next time we should not update sequence of fields in trunk/branch-2, what we need to do is to make sure fields of protos across branches has same id. Not sure if I understand correctly, are you proposing that we should NOT declare new fields in sequence ? for eg. if the last field index is 10 for a struct in trunk, if we want to add a new field, we should set it as something like 15 and not 11 ?
          Hide
          leftnoteasy Wangda Tan added a comment -

          An additional note:

          Junping and I investigated PB compatible cases when adding new fields to middle of a proto.

          Let's say:
          PB1Proto:

            optional int32 w = 1;  
            optional int32 x = 2;
            optional int32 z = 4;
          

          PB2Proto:

            optional int32 w = 1;  
            optional int32 x = 2;
            optional int32 y = 3;
            optional int32 z = 4;
          

          PB2Proto can read PB1Proto without any issue.

          So next time we should not update sequence of fields in trunk/branch-2, what we need to do is to make sure fields of protos across branches has same id.

          Show
          leftnoteasy Wangda Tan added a comment - An additional note: Junping and I investigated PB compatible cases when adding new fields to middle of a proto. Let's say: PB1Proto: optional int32 w = 1; optional int32 x = 2; optional int32 z = 4; PB2Proto: optional int32 w = 1; optional int32 x = 2; optional int32 y = 3; optional int32 z = 4; PB2Proto can read PB1Proto without any issue. So next time we should not update sequence of fields in trunk/branch-2, what we need to do is to make sure fields of protos across branches has same id.
          Hide
          kkaranasos Konstantinos Karanasos added a comment -

          Marked the JIRA as resolved and added a release note.
          Thank you all for the invaluable feedback, the contributions and the extensive code reviews!
          Among all the people that contributed, I would like to particularly call out Arun Suresh, Carlo Curino, Chris Douglas, Subru Krishnan, Kishore Chaliparambil, Karthik Kambatla, Wangda Tan, Jian He, and Sriram Rao.

          Show
          kkaranasos Konstantinos Karanasos added a comment - Marked the JIRA as resolved and added a release note. Thank you all for the invaluable feedback, the contributions and the extensive code reviews! Among all the people that contributed, I would like to particularly call out Arun Suresh , Carlo Curino , Chris Douglas , Subru Krishnan , Kishore Chaliparambil , Karthik Kambatla , Wangda Tan , Jian He , and Sriram Rao .
          Hide
          asuresh Arun Suresh added a comment -

          So, to unblock YARN-4832 for 2.8, we'll add the new field before the ContainerQueuingLimitProto instead of after,

          Sure.. sounds good

          Show
          asuresh Arun Suresh added a comment - So, to unblock YARN-4832 for 2.8, we'll add the new field before the ContainerQueuingLimitProto instead of after, Sure.. sounds good
          Hide
          jianhe Jian He added a comment -

          Arun Suresh, Konstantinos Karanasos, sounds good to me.
          So, to unblock YARN-4832 for 2.8, we'll add the new field before the ContainerQueuingLimitProto instead of after, and you can do the backport later on.

          Show
          jianhe Jian He added a comment - Arun Suresh , Konstantinos Karanasos , sounds good to me. So, to unblock YARN-4832 for 2.8, we'll add the new field before the ContainerQueuingLimitProto instead of after, and you can do the backport later on.
          Hide
          asuresh Arun Suresh added a comment -

          Jian He, Actually we would prefer it being in branch-2 too..
          I'd say if we can commit YARN-5090 too.. it would be nice ( Wangda Tan has already given a +1 )

          Aside from that.. I had planned atleast 2 more JIRAs before we can release

          1. Documentation patch
          2. Minor flag in MapReduce to allow end users to test Distributed Scheduling (say allow a percentage of Map tasks to requested as OPPORTUNISTIC.. with default being 0)

          We can definitely add the above 2 after pushing to branch-2

          Show
          asuresh Arun Suresh added a comment - Jian He , Actually we would prefer it being in branch-2 too.. I'd say if we can commit YARN-5090 too.. it would be nice ( Wangda Tan has already given a +1 ) Aside from that.. I had planned atleast 2 more JIRAs before we can release Documentation patch Minor flag in MapReduce to allow end users to test Distributed Scheduling (say allow a percentage of Map tasks to requested as OPPORTUNISTIC.. with default being 0) We can definitely add the above 2 after pushing to branch-2
          Hide
          kkaranasos Konstantinos Karanasos added a comment -

          Hi Jian He and thanks for bringing this up.
          We are actually thinking to push our changes to branch-2 as well.
          As you say, we will focus on pushing first the patches that could cause incompatibilities, such as the ones you mentioned.

          Show
          kkaranasos Konstantinos Karanasos added a comment - Hi Jian He and thanks for bringing this up. We are actually thinking to push our changes to branch-2 as well. As you say, we will focus on pushing first the patches that could cause incompatibilities, such as the ones you mentioned.
          Hide
          jianhe Jian He added a comment -

          Arun Suresh, Konstantinos Karanasos, how possible is this going to branch-2 too ? At least, I see YARN-5073(refactoring) can be committed in branch-2.
          Asking this because this will cause code divergence from trunk and branch-2. And other patches may require to write two versions.

          One other thing is the newly added protocol buffer field, I remember that protocol buffer requires field tagged number to be the same for compatibility. Since A new field ContainerQueuingLimitProto is added in NodeHeartBeatResponse and occupied number 14, but that is only committed in trunk. If we need to add a new field in NodeHeartBeatResponse in branch-2 (YARN-4832) which also uses field number 14. This will cause branch-2 and trunk incompatible in NM heartbeat.

          Show
          jianhe Jian He added a comment - Arun Suresh , Konstantinos Karanasos , how possible is this going to branch-2 too ? At least, I see YARN-5073 (refactoring) can be committed in branch-2. Asking this because this will cause code divergence from trunk and branch-2. And other patches may require to write two versions. One other thing is the newly added protocol buffer field, I remember that protocol buffer requires field tagged number to be the same for compatibility. Since A new field ContainerQueuingLimitProto is added in NodeHeartBeatResponse and occupied number 14, but that is only committed in trunk. If we need to add a new field in NodeHeartBeatResponse in branch-2 ( YARN-4832 ) which also uses field number 14. This will cause branch-2 and trunk incompatible in NM heartbeat.
          Hide
          kkaranasos Konstantinos Karanasos added a comment -

          Hi Tan, Wangda,

          Thanks for pointing out HADOOP-11552. It seems it can also be used for the same purpose.
          I would suggest to follow the technique of frequent AM-LocalRM heartbeats and less frequent LocalRM-RM heartbeats to start with. Once HADOOP-11552 gets resolved, we can consider using it.

          I think top-k node list technique cannot completely solve the over subscribe issue, in a production cluster, application comes in waves, it is possible that few large applications can exhaust all resources in a cluster within few seconds. Maybe another possible approach to mitigate the issue is: propagating queue-able containers from NM to RM periodically, so NM can still make decision but RM can also be aware of these queue-able containers.

          As long as k is sufficiently big, the phenomenon you describe should not be very pronounced.
          Moreover, corrective mechanisms (YARN-2888) will lead to moving tasks from highly-loaded nodes to less busy ones.
          Going further, what you are suggesting would also make sense.

          Show
          kkaranasos Konstantinos Karanasos added a comment - Hi Tan, Wangda , Thanks for pointing out HADOOP-11552 . It seems it can also be used for the same purpose. I would suggest to follow the technique of frequent AM-LocalRM heartbeats and less frequent LocalRM-RM heartbeats to start with. Once HADOOP-11552 gets resolved, we can consider using it. I think top-k node list technique cannot completely solve the over subscribe issue, in a production cluster, application comes in waves, it is possible that few large applications can exhaust all resources in a cluster within few seconds. Maybe another possible approach to mitigate the issue is: propagating queue-able containers from NM to RM periodically, so NM can still make decision but RM can also be aware of these queue-able containers. As long as k is sufficiently big, the phenomenon you describe should not be very pronounced. Moreover, corrective mechanisms ( YARN-2888 ) will lead to moving tasks from highly-loaded nodes to less busy ones. Going further, what you are suggesting would also make sense.
          Hide
          leftnoteasy Wangda Tan added a comment -

          Hi Konstantinos Karanasos,
          Thanks for reply:

          We are planning to address this by having smaller heartbeat intervals in the AM-LocalRM communication when compared to the LocalRM-RM. For instance, the AM-LocalRM heartbeat interval can be set to 50ms, while the LocalRM-RM interval to 200ms (in other words, we will only propagate to the RM only one in every four heartbeats).

          Maybe you could also take a look at HADOOP-11552, which could possibly achieve better latency and reduce heartbeat frequency.

          This is a valid concern. The best way to minimize preemption is through the "top-k node list" technique described above. As the LocalRM will be placing the QUEUEABLE containers to the least loaded nodes, preemption will be minimized.

          I think top-k node list technique cannot completely solve the over subscribe issue, in a production cluster, application comes in waves, it is possible that few large applications can exhaust all resources in a cluster within few seconds. Maybe another possible approach to mitigate the issue is: propagating queue-able containers from NM to RM periodically, so NM can still make decision but RM can also be aware of these queue-able containers.

          That said, as you also mention, QUEUEABLE containers are more suitable for short-running tasks, where the probability of a container being preempted is smaller.

          Ideally it's better to support all non-long-running-service tasks. LocalRM could allocate short-running queue-able tasks and RM an allocate other queue-able tasks.

          Show
          leftnoteasy Wangda Tan added a comment - Hi Konstantinos Karanasos , Thanks for reply: We are planning to address this by having smaller heartbeat intervals in the AM-LocalRM communication when compared to the LocalRM-RM. For instance, the AM-LocalRM heartbeat interval can be set to 50ms, while the LocalRM-RM interval to 200ms (in other words, we will only propagate to the RM only one in every four heartbeats). Maybe you could also take a look at HADOOP-11552 , which could possibly achieve better latency and reduce heartbeat frequency. This is a valid concern. The best way to minimize preemption is through the "top-k node list" technique described above. As the LocalRM will be placing the QUEUEABLE containers to the least loaded nodes, preemption will be minimized. I think top-k node list technique cannot completely solve the over subscribe issue, in a production cluster, application comes in waves, it is possible that few large applications can exhaust all resources in a cluster within few seconds. Maybe another possible approach to mitigate the issue is: propagating queue-able containers from NM to RM periodically, so NM can still make decision but RM can also be aware of these queue-able containers. That said, as you also mention, QUEUEABLE containers are more suitable for short-running tasks, where the probability of a container being preempted is smaller. Ideally it's better to support all non-long-running-service tasks. LocalRM could allocate short-running queue-able tasks and RM an allocate other queue-able tasks.
          Hide
          kkaranasos Konstantinos Karanasos added a comment -

          Thank you for the detailed comments, Wangda Tan.

          Regarding #1:

          • Indeed the AM-LocalRM communication should be much more frequent than the LocalRM-RM (and subsequently AM-RM) communication, in order to achieve mili-second latency allocations.
            We are planning to address this by having smaller heartbeat intervals in the AM-LocalRM communication when compared to the LocalRM-RM. For instance, the AM-LocalRM heartbeat interval can be set to 50ms, while the LocalRM-RM interval to 200ms (in other words, we will only propagate to the RM only one in every four heartbeats).
            We will soon create a sub-JIRA for this.
          • Each NM will periodically estimate its expected queue wait time (YARN-2886). This can simply be based on the number of tasks currently in its queue, or (even better) based on the estimated execution times of those tasks (in case they are available). Then, this expected queue wait time is pushed through the NM-RM heartbeats to the ClusterMonitor (YARN-4412) that is running as a service in the RM. The ClusterMonitor gathers this information from all nodes, periodically computes the least loaded nodes (i.e., with the smallest queue wait times), and adds that list to the heartbeat response, so that all nodes (and in turn LocalRMs) get the list. This list is then used during scheduling in the LocalRM.
            Note that simpler solutions (such as the power of two choices used in Sparrow) could be employed, but our experiments have shown that the above "top-k node list" leads to considerably better placement (and thus load balancing), especially when task durations are heterogeneous.

          Regarding #2:
          This is a valid concern. The best way to minimize preemption is through the "top-k node list" technique described above. As the LocalRM will be placing the QUEUEABLE containers to the least loaded nodes, preemption will be minimized.
          More techniques can be used to further mitigate the problem. For instance, we can "promote" a QUEUEABLE container to a GUARANTEED one in case it has been preempted more than k times.
          Moreover, we can dynamically set limits to the number of QUEUEABLE containers accepted by a node in case of excessive load due to GUARANTEED containers.
          That said, as you also mention, QUEUEABLE containers are more suitable for short-running tasks, where the probability of a container being preempted is smaller.

          Show
          kkaranasos Konstantinos Karanasos added a comment - Thank you for the detailed comments, Wangda Tan . Regarding #1: Indeed the AM-LocalRM communication should be much more frequent than the LocalRM-RM (and subsequently AM-RM) communication, in order to achieve mili-second latency allocations. We are planning to address this by having smaller heartbeat intervals in the AM-LocalRM communication when compared to the LocalRM-RM. For instance, the AM-LocalRM heartbeat interval can be set to 50ms, while the LocalRM-RM interval to 200ms (in other words, we will only propagate to the RM only one in every four heartbeats). We will soon create a sub-JIRA for this. Each NM will periodically estimate its expected queue wait time ( YARN-2886 ). This can simply be based on the number of tasks currently in its queue, or (even better) based on the estimated execution times of those tasks (in case they are available). Then, this expected queue wait time is pushed through the NM-RM heartbeats to the ClusterMonitor ( YARN-4412 ) that is running as a service in the RM. The ClusterMonitor gathers this information from all nodes, periodically computes the least loaded nodes (i.e., with the smallest queue wait times), and adds that list to the heartbeat response, so that all nodes (and in turn LocalRMs) get the list. This list is then used during scheduling in the LocalRM. Note that simpler solutions (such as the power of two choices used in Sparrow) could be employed, but our experiments have shown that the above "top-k node list" leads to considerably better placement (and thus load balancing), especially when task durations are heterogeneous. Regarding #2: This is a valid concern. The best way to minimize preemption is through the "top-k node list" technique described above. As the LocalRM will be placing the QUEUEABLE containers to the least loaded nodes, preemption will be minimized. More techniques can be used to further mitigate the problem. For instance, we can "promote" a QUEUEABLE container to a GUARANTEED one in case it has been preempted more than k times. Moreover, we can dynamically set limits to the number of QUEUEABLE containers accepted by a node in case of excessive load due to GUARANTEED containers. That said, as you also mention, QUEUEABLE containers are more suitable for short-running tasks, where the probability of a container being preempted is smaller.
          Hide
          leftnoteasy Wangda Tan added a comment -

          Thanks Konstantinos Karanasos, Arun Suresh,

          I just caught up with latest design doc, my 2 cents:
          There're two major purpose of distributed RM, 1) get better allocation latency 2) leverage idle resources.

          #1 will be achieved when

          • AM -> LocalRM communication can be done within a single RPC call. (Doesn't do heartbeat like normal AM-RM allocation), otherwise it will be hard to achieve milli-seconds level latency.
          • LocalRM has enough information to allocate resource on a NM which could be directly used without waiting. I think stochastic + caching some information of other LocalRM could solve the problem.

          #2 can be achieved, but since the distributed RM solution doesn't have a global picture of resources and guaranteed containers can always preempt queueable containers. This could lead to excessive queueable containers preempted.
          If we can decide where to allocate queueable container from RM, RM could avoid a lots of such preemptions. (Instead of allocating on a node has lots of queueable containers, allocate on node with "real" idle resources).
          To me, this becomes a bigger issue if application wants to use opportunistic resources to run normal containers (such as a 10 min MR task). How to guarantee RM doesn't allocate more resources for a long time to LocalRM is a problem. IMO distributed RM is more suitable for short-lifed (few seconds) and low latency tasks.

          Show
          leftnoteasy Wangda Tan added a comment - Thanks Konstantinos Karanasos , Arun Suresh , I just caught up with latest design doc, my 2 cents: There're two major purpose of distributed RM, 1) get better allocation latency 2) leverage idle resources. #1 will be achieved when AM -> LocalRM communication can be done within a single RPC call. (Doesn't do heartbeat like normal AM-RM allocation), otherwise it will be hard to achieve milli-seconds level latency. LocalRM has enough information to allocate resource on a NM which could be directly used without waiting. I think stochastic + caching some information of other LocalRM could solve the problem. #2 can be achieved, but since the distributed RM solution doesn't have a global picture of resources and guaranteed containers can always preempt queueable containers. This could lead to excessive queueable containers preempted. If we can decide where to allocate queueable container from RM, RM could avoid a lots of such preemptions. (Instead of allocating on a node has lots of queueable containers, allocate on node with "real" idle resources). To me, this becomes a bigger issue if application wants to use opportunistic resources to run normal containers (such as a 10 min MR task). How to guarantee RM doesn't allocate more resources for a long time to LocalRM is a problem. IMO distributed RM is more suitable for short-lifed (few seconds) and low latency tasks.
          Hide
          kkaranasos Konstantinos Karanasos added a comment -

          Adding the first version of the design document.

          Show
          kkaranasos Konstantinos Karanasos added a comment - Adding the first version of the design document.
          Hide
          kkaranasos Konstantinos Karanasos added a comment -

          Thanks for the input, Anubhav Dhoot. This is an interesting discussion.

          There are indeed cases that distributed scheduling can hurt job latency. This is more pronounced in the following cases:

          1. Queueable containers are used both for short- and long-running tasks.
          2. For Jobs that have many tasks (chances that one of these tasks will get stuck in a queue are higher).
          3. Cluster load is higher.

          Based on the above situations, a first observation is that queueable containers should be mostly used for short-running tasks, if job latency is of importance.
          Moreover, when jobs have a big number of tasks, probably the AM policy should ask for optimistic containers only for a subset of them (even if they all are short-running).

          Still though, as you also mention, corrective mechanisms should be used to further improve latency.

          • One such mechanism is queuing in multiple locations as is done by Sparrow and Apollo. In that case the LocalRM should pick two nodes instead of one to queue the request. This is something we have not tried yet, but it may be useful to do so.
          • Another mechanism we are proposing is queue rebalancing, that is, whenever some queues have much bigger load than others, we dequeue some of its requests and send them to a less loaded queue. Of course, we need to take care when to dequeue containers, because we may end up increasing the latency if we accidentally dequeue the same request many times.
          • A last mechanism that seems interesting is reordering of requests within a queue, based on some policy (e.g., based on the submission time of the application the task belongs to).

          More thoughts are definitely welcome.

          Show
          kkaranasos Konstantinos Karanasos added a comment - Thanks for the input, Anubhav Dhoot . This is an interesting discussion. There are indeed cases that distributed scheduling can hurt job latency. This is more pronounced in the following cases: Queueable containers are used both for short- and long-running tasks. For Jobs that have many tasks (chances that one of these tasks will get stuck in a queue are higher). Cluster load is higher. Based on the above situations, a first observation is that queueable containers should be mostly used for short-running tasks, if job latency is of importance. Moreover, when jobs have a big number of tasks, probably the AM policy should ask for optimistic containers only for a subset of them (even if they all are short-running). Still though, as you also mention, corrective mechanisms should be used to further improve latency. One such mechanism is queuing in multiple locations as is done by Sparrow and Apollo. In that case the LocalRM should pick two nodes instead of one to queue the request. This is something we have not tried yet, but it may be useful to do so. Another mechanism we are proposing is queue rebalancing , that is, whenever some queues have much bigger load than others, we dequeue some of its requests and send them to a less loaded queue. Of course, we need to take care when to dequeue containers, because we may end up increasing the latency if we accidentally dequeue the same request many times. A last mechanism that seems interesting is reordering of requests within a queue, based on some policy (e.g., based on the submission time of the application the task belongs to). More thoughts are definitely welcome.
          Hide
          adhoot Anubhav Dhoot added a comment -

          +1 for notion of distributed scheduling. I think it will go a long way for addressing both latency and scale goals for YARN.

          In my experience with using similar distributed scheduling systems we can run into following types of issues
          a) the node is currently full of running containers and the estimate of when capacity will free up for running queued requests could be hard/wrong. Your request might be queued a long time affecting latency of the queue-able container startup
          b) multiple LocalRMs could race to grab available space on a NM and one might get queued behind other requests having similar effects as a).

          For sake of discussion of mechanisms, I would suggest discussion of pros and cons for ability to 1) schedule queueable containers on multiple nodes, 2) ability to cancel queued requests
          Giving the power of at least 2 NM choices could address a lot of variability of queue-able container startup latency.
          One way is keep the queue of requests in the NM, but if needed, NMs ultimately confirm with the requesting LocalRM to ensure that the queued request is still valid.

          Show
          adhoot Anubhav Dhoot added a comment - +1 for notion of distributed scheduling. I think it will go a long way for addressing both latency and scale goals for YARN. In my experience with using similar distributed scheduling systems we can run into following types of issues a) the node is currently full of running containers and the estimate of when capacity will free up for running queued requests could be hard/wrong. Your request might be queued a long time affecting latency of the queue-able container startup b) multiple LocalRMs could race to grab available space on a NM and one might get queued behind other requests having similar effects as a). For sake of discussion of mechanisms, I would suggest discussion of pros and cons for ability to 1) schedule queueable containers on multiple nodes, 2) ability to cancel queued requests Giving the power of at least 2 NM choices could address a lot of variability of queue-able container startup latency. One way is keep the queue of requests in the NM, but if needed, NMs ultimately confirm with the requesting LocalRM to ensure that the queued request is still valid.
          Hide
          curino Carlo Curino added a comment -

          I am going to echo Konstantinos Karanasos regarding "malicious" AMs.

          The key architectural change we propose is to introduce a proxy layer (YARN-2884). This is giving us a "place" that is both distributed, but part of the infrastructure (thus inherently trusted) where to enact policies.
          This is where we host the LocalRM functionality of YARN-2885. With this in place we do not have to depend on the trusting the AM regarding distributed decisions (the AM only exposes need for containers of different type).
          On the contrary, we can enable a broad spectrum of infrastructure-level policies, that can leverage explicit or implicit information to impose caps, or to balance (or skew) where the queuable containers should be allocated etc.

          As we have done in the past, we are working towards providing rather general purpose mechanisms, and propose a first set of policies (AM, LocalRM, NM start/stop of containers). Policies can be evolved/overridden
          easily depending on use-cases, while mechanisms are a little harder to change. To this end, discussing carefully "other" use cases, such as the conversation around using queuable containers for Impala, is very important,
          as we might have missed "hooks" as part of the mechanisms, that are necessary to support those scenarios.

          Show
          curino Carlo Curino added a comment - I am going to echo Konstantinos Karanasos regarding "malicious" AMs. The key architectural change we propose is to introduce a proxy layer ( YARN-2884 ). This is giving us a "place" that is both distributed, but part of the infrastructure (thus inherently trusted) where to enact policies. This is where we host the LocalRM functionality of YARN-2885 . With this in place we do not have to depend on the trusting the AM regarding distributed decisions (the AM only exposes need for containers of different type). On the contrary, we can enable a broad spectrum of infrastructure-level policies, that can leverage explicit or implicit information to impose caps, or to balance (or skew) where the queuable containers should be allocated etc. As we have done in the past, we are working towards providing rather general purpose mechanisms , and propose a first set of policies (AM, LocalRM, NM start/stop of containers). Policies can be evolved/overridden easily depending on use-cases, while mechanisms are a little harder to change. To this end, discussing carefully "other" use cases, such as the conversation around using queuable containers for Impala, is very important, as we might have missed "hooks" as part of the mechanisms, that are necessary to support those scenarios.
          Hide
          kkaranasos Konstantinos Karanasos added a comment -

          I used the wrong name in the above comment – it was referring to Devaraj K's comment.

          Show
          kkaranasos Konstantinos Karanasos added a comment - I used the wrong name in the above comment – it was referring to Devaraj K 's comment.
          Hide
          kkaranasos Konstantinos Karanasos added a comment -

          Devaraj Das, to answer your questions:

          1. Guaranteed-start containers always have priority over queueable ones. Thus, in the case you describe, if not both requests can be accommodated by the NM, the guaranteed-start will start first.
          2. If the queueable one was started before the guaranteed-start arrived, it will be pre-empted/killed for the guaranteed-start to begin execution.
          3. Queueable requests are submitted by the AM in the Local RM running in the same node as the AM, but those requests can be queued at any NM of the cluster (we pick at each moment the most idle ones to queue those requests).
          Show
          kkaranasos Konstantinos Karanasos added a comment - Devaraj Das , to answer your questions: Guaranteed-start containers always have priority over queueable ones. Thus, in the case you describe, if not both requests can be accommodated by the NM, the guaranteed-start will start first. If the queueable one was started before the guaranteed-start arrived, it will be pre-empted/killed for the guaranteed-start to begin execution. Queueable requests are submitted by the AM in the Local RM running in the same node as the AM, but those requests can be queued at any NM of the cluster (we pick at each moment the most idle ones to queue those requests).
          Hide
          kkaranasos Konstantinos Karanasos added a comment -

          Tan, Wangda, regarding your question about how the AM will know which NM is more idle than others, this is related with YARN-2886. Each NM estimates its waiting queue time (based on the tasks running and those waiting in the queue already) and sends this waiting time to the RM through the heartbeat. Note that this is just an integer, so it is very lightweight. Then the RM can push this information to the rest of the NMs (again through the heartbeats). This way each node knows the queue status of the other NMs and can decide where to queue its queueable requests. However, since this information may be always precise (due to bad estimation or stale info), we also introduce correction mechanisms for rebalancing the queues, if need be (YARN-2888).

          Regarding your other questions:

          1. These "malicious" AMs is one of the basic reasons we have introduced the Local RM. The AMs can make queueable requests only to the Local RM, who can throttle down "aggressive" AMs without even needing to reach the central RM. Clearly, as you mention, the central RM can also be involved for imposing elaborate fairness/capacity constraints, if those are needed.
          2. Promoting a queueable container to a guaranteed-start one is indeed interesting, and we have been investigating the cases for which it would bring benefits. One is the case you mention. Another is in case a queueable container has been pre-empted/killed many times due to other guaranteed-start requests.
          Show
          kkaranasos Konstantinos Karanasos added a comment - Tan, Wangda , regarding your question about how the AM will know which NM is more idle than others, this is related with YARN-2886 . Each NM estimates its waiting queue time (based on the tasks running and those waiting in the queue already) and sends this waiting time to the RM through the heartbeat. Note that this is just an integer, so it is very lightweight. Then the RM can push this information to the rest of the NMs (again through the heartbeats). This way each node knows the queue status of the other NMs and can decide where to queue its queueable requests. However, since this information may be always precise (due to bad estimation or stale info), we also introduce correction mechanisms for rebalancing the queues, if need be ( YARN-2888 ). Regarding your other questions: These "malicious" AMs is one of the basic reasons we have introduced the Local RM. The AMs can make queueable requests only to the Local RM, who can throttle down "aggressive" AMs without even needing to reach the central RM. Clearly, as you mention, the central RM can also be involved for imposing elaborate fairness/capacity constraints, if those are needed. Promoting a queueable container to a guaranteed-start one is indeed interesting, and we have been investigating the cases for which it would bring benefits. One is the case you mention. Another is in case a queueable container has been pre-empted/killed many times due to other guaranteed-start requests.
          Hide
          devaraj.k Devaraj K added a comment -

          +1 for the idea Sriram Rao, Carlo Curino . I just wanted to know these if I am not missing something from the above.

          1. If the OPTIMISTIC Container is assigned to AM, and also at the same time RM assigned a container i.e. CONSERVATIVE for the same resource, which one NM will consider and start it?

          2. If the OPTIMISTIC Container is assigned to AM and started it, and NM receives a container start request for CONSERVATIVE and resources are not available, will the NM preempt the running OPTIMISTIC Containers or it will make CONSERVATIVE request to wait for completing the OPTIMISTIC Containers?

          3. Any provision for AM to request OPTIMISTIC containers in the remote NM also?

          Show
          devaraj.k Devaraj K added a comment - +1 for the idea Sriram Rao , Carlo Curino . I just wanted to know these if I am not missing something from the above. 1. If the OPTIMISTIC Container is assigned to AM, and also at the same time RM assigned a container i.e. CONSERVATIVE for the same resource, which one NM will consider and start it? 2. If the OPTIMISTIC Container is assigned to AM and started it, and NM receives a container start request for CONSERVATIVE and resources are not available, will the NM preempt the running OPTIMISTIC Containers or it will make CONSERVATIVE request to wait for completing the OPTIMISTIC Containers? 3. Any provision for AM to request OPTIMISTIC containers in the remote NM also?
          Hide
          leftnoteasy Wangda Tan added a comment -

          Thanks very much for explanation from Konstantinos Karanasos, Sriram Rao, I reply together,

          Now I can better understand the use case. Yes, the queueable containers not necessarily need to send to central RM. Except we want to add other features like queue balancing, etc.

          One more question, how AM can know which NM is more idle than others? Since simply querying NM status from every NM is not efficient enough.

          And I'm thinking the distributed scheduling could be integrated to existing scheduler, like Capacity Scheduler. Some other features could be added with this,

          1. For now, we trust AM will make correct opportunistically container launch request. But considering a case like a large cluster has only few applications use opportunistically launch, others are conservative apps. It is possible an AM can "steal" a lot of resource from NMs from cluster by sending opportunistical launch request to all NMs. We can have a centralized RM combine resource usage of queueable/conservative containers to enforce fairness. And put malicious AMs to blacklists.
          2. As the name of the title (distributed scheduling), we may be able to do more than "opportunistically". For example, we can launch a opportunistically container in NM, but it is possible to become a consertive container after heartbeat to RM if the resource meets capacity settings for each queue.

          Thanks in advance!

          Show
          leftnoteasy Wangda Tan added a comment - Thanks very much for explanation from Konstantinos Karanasos , Sriram Rao , I reply together, Now I can better understand the use case. Yes, the queueable containers not necessarily need to send to central RM. Except we want to add other features like queue balancing, etc. One more question, how AM can know which NM is more idle than others? Since simply querying NM status from every NM is not efficient enough. And I'm thinking the distributed scheduling could be integrated to existing scheduler, like Capacity Scheduler. Some other features could be added with this, For now, we trust AM will make correct opportunistically container launch request. But considering a case like a large cluster has only few applications use opportunistically launch, others are conservative apps. It is possible an AM can "steal" a lot of resource from NMs from cluster by sending opportunistical launch request to all NMs. We can have a centralized RM combine resource usage of queueable/conservative containers to enforce fairness. And put malicious AMs to blacklists. As the name of the title (distributed scheduling), we may be able to do more than "opportunistically". For example, we can launch a opportunistically container in NM, but it is possible to become a consertive container after heartbeat to RM if the resource meets capacity settings for each queue. Thanks in advance!
          Hide
          sriramsrao Sriram Rao added a comment -

          Wangda Tan By definition, the allocation decisions made by the central RM win out. That is, whenever there is a conflict, guaranteed-start (or CONSERVATIVE) containers will be executed prior to queueable (or OPTIMISTIC) containers. This could also means that the NM may be forced to preempt running queueable containers to make room. Lastly, to allow some level of predictability in terms of execution time for queueable containers, we could use leases---a queueable container is allowed to execute for at most N secs even when there is conflict and if the container hasn't exited, the NM will preempt them after that time interval elapses (i.e., lease expires). This mechanism can allow minimizing preemption for queueable containers.

          Re: your other questions:

          1. Capacity is enforced for guaranteed-start containers. For queueable containers, policies could be pushed down from central-RM (YARN-2885)
          2. It is not necessary that the queueable containers factor into central RM's allocation choices. That said, having that information at the central-RM can help minimize preemption.
          3. For enabling load balancing of queues at the NM's ((YARN-2888), allow AM's to make choices of where to submit queueable containers (YARN-2887), exposing queue information to local-RM's is desirable.
          Show
          sriramsrao Sriram Rao added a comment - Wangda Tan By definition, the allocation decisions made by the central RM win out. That is, whenever there is a conflict, guaranteed-start (or CONSERVATIVE) containers will be executed prior to queueable (or OPTIMISTIC) containers. This could also means that the NM may be forced to preempt running queueable containers to make room. Lastly, to allow some level of predictability in terms of execution time for queueable containers, we could use leases---a queueable container is allowed to execute for at most N secs even when there is conflict and if the container hasn't exited, the NM will preempt them after that time interval elapses (i.e., lease expires). This mechanism can allow minimizing preemption for queueable containers. Re: your other questions: Capacity is enforced for guaranteed-start containers. For queueable containers, policies could be pushed down from central-RM ( YARN-2885 ) It is not necessary that the queueable containers factor into central RM's allocation choices. That said, having that information at the central-RM can help minimize preemption. For enabling load balancing of queues at the NM's (( YARN-2888 ), allow AM's to make choices of where to submit queueable containers ( YARN-2887 ), exposing queue information to local-RM's is desirable.
          Hide
          kkaranasos Konstantinos Karanasos added a comment -

          Sujeet Varakhedi, also the Apollo paper (OSDI 2014) has interesting ideas about distributed scheduling.

          Tan, Wangda, glad you like the idea and thanks for the interesting points. To answer your questions:
          1. Apart from the limit that the LocalRM can impose in the number of queueable containers that each AM can receive (for which the central RM does not need to be involved), in the heartbeat response from the RM to the NM, information about the status of the other queues of the system will be passed as well. This way we will be able to impose global policies (such as capacity) in a distributed fashion. BTW this information is also used by the LocalRMs to decide in which NMs to queue requests.
          2. If no policies need to be imposed, the central RM does not need to know anything about the queueable containers that each AM uses. Limits in the number of queueable containers per AM can be imposed directly by the LocalRM. However, in case fine-grained policies need to be imposed (as mentioned in point (1) above, such as number of queueable containers per queue in the capacity scheduler), the central RM can receive information about the number of queueable containers used by each AM, so that it imposes limits per queue. Clearly, the more information you pass to the central RM, the more powerful policies you can impose, but also the bigger the load you push to the central RM. So, there is a sweet-spot there based on the needs of each cluster.
          3. This is a good point as well. Such information can be piggybacked in the heartbeats to the central RM (again, with the tradeoffs discussed above).

          Show
          kkaranasos Konstantinos Karanasos added a comment - Sujeet Varakhedi , also the Apollo paper (OSDI 2014) has interesting ideas about distributed scheduling. Tan, Wangda , glad you like the idea and thanks for the interesting points. To answer your questions: 1. Apart from the limit that the LocalRM can impose in the number of queueable containers that each AM can receive (for which the central RM does not need to be involved), in the heartbeat response from the RM to the NM, information about the status of the other queues of the system will be passed as well. This way we will be able to impose global policies (such as capacity) in a distributed fashion. BTW this information is also used by the LocalRMs to decide in which NMs to queue requests. 2. If no policies need to be imposed, the central RM does not need to know anything about the queueable containers that each AM uses. Limits in the number of queueable containers per AM can be imposed directly by the LocalRM. However, in case fine-grained policies need to be imposed (as mentioned in point (1) above, such as number of queueable containers per queue in the capacity scheduler), the central RM can receive information about the number of queueable containers used by each AM, so that it imposes limits per queue. Clearly, the more information you pass to the central RM, the more powerful policies you can impose, but also the bigger the load you push to the central RM. So, there is a sweet-spot there based on the needs of each cluster. 3. This is a good point as well. Such information can be piggybacked in the heartbeats to the central RM (again, with the tradeoffs discussed above).
          Hide
          leftnoteasy Wangda Tan added a comment -

          Thanks Sriram Rao for bringing up the great idea and Konstantinos Karanasos/Carlo Curino's explanations. Definitely we need such mechanisms to have low-latency container launching to support millisec-level-latency tasks.

          Some questions about this,

          1. Since the LocalRMs will be totally distributed, does it still possible to enforce capacity between queues?
          2. Will such opportunistical containers come to view of the central RM (used to schedule CONSERVATIVE containers)?
            1. If yes, will the central RM can decide if a opportunistical container is valid or not (saying #containers excesses the app's limitation)? And will the preemption still works for opportunistical containers
            2. If no, should we have someone to coordinate such containers?
          3. Will central scheduler state (maybe not completely, but important info like queue used resource, etc.) broadcast to distributed LocalRMs? I think it might be usaful for LocalRMs to decide which opportunistical container should go first.

          Thanks in advance!

          Wangda

          Show
          leftnoteasy Wangda Tan added a comment - Thanks Sriram Rao for bringing up the great idea and Konstantinos Karanasos / Carlo Curino 's explanations. Definitely we need such mechanisms to have low-latency container launching to support millisec-level-latency tasks. Some questions about this, Since the LocalRMs will be totally distributed, does it still possible to enforce capacity between queues? Will such opportunistical containers come to view of the central RM (used to schedule CONSERVATIVE containers)? If yes, will the central RM can decide if a opportunistical container is valid or not (saying #containers excesses the app's limitation)? And will the preemption still works for opportunistical containers If no, should we have someone to coordinate such containers? Will central scheduler state (maybe not completely, but important info like queue used resource, etc.) broadcast to distributed LocalRMs? I think it might be usaful for LocalRMs to decide which opportunistical container should go first. Thanks in advance! Wangda
          Hide
          sujeetv Sujeet Varakhedi added a comment -

          + 1 for distributed scheduling and SQL engines for Hadoop can greatly benefit from it. We also need to look at a design we can give AMs more control over scheduling policies where RM just acts a source of overall cluster state, NM's have local queues and then based on NM queue wait times AM's can decide where to requests tasks. Similar to how Sparrow works. This kind of scheduling becomes important for services that need dedicated non-shared clusters like HBASE and HAWQ.

          Show
          sujeetv Sujeet Varakhedi added a comment - + 1 for distributed scheduling and SQL engines for Hadoop can greatly benefit from it. We also need to look at a design we can give AMs more control over scheduling policies where RM just acts a source of overall cluster state, NM's have local queues and then based on NM queue wait times AM's can decide where to requests tasks. Similar to how Sparrow works. This kind of scheduling becomes important for services that need dedicated non-shared clusters like HBASE and HAWQ.
          Hide
          sriramsrao Sriram Rao added a comment -

          Chen He The number of AM's running on any machine is configurable/small---on the order of a few tens, and so the overhead on LocalRM should be negligible.

          Show
          sriramsrao Sriram Rao added a comment - Chen He The number of AM's running on any machine is configurable/small---on the order of a few tens, and so the overhead on LocalRM should be negligible.
          Hide
          airbots Chen He added a comment -

          This is a interesting idea. Distributed scheduling and global scheduling have their own pros and cons. For short, global scheduling can achieve optimal matching between tasks and resources but may have scalability problem when system becomes larger and larger. Distributed scheduling is scalable but may reach sub-optimal if there is no communication between those distributed schedulers.

          The LocalRM can reduce the RM's burden by doing communications to local AMs. It is a good idea. IMHO, the worker nodes become increasingly powerful and large (more mems and cores). Is that possible that the LocalRM affects NM's performance if there are many AMs running on a single server?

          Show
          airbots Chen He added a comment - This is a interesting idea. Distributed scheduling and global scheduling have their own pros and cons. For short, global scheduling can achieve optimal matching between tasks and resources but may have scalability problem when system becomes larger and larger. Distributed scheduling is scalable but may reach sub-optimal if there is no communication between those distributed schedulers. The LocalRM can reduce the RM's burden by doing communications to local AMs. It is a good idea. IMHO, the worker nodes become increasingly powerful and large (more mems and cores). Is that possible that the LocalRM affects NM's performance if there are many AMs running on a single server?
          Hide
          kkaranasos Konstantinos Karanasos added a comment -

          Adding some more details, now that we have added the first sub-tasks.

          In YARN-2882 we introduce two types of containers: guaranteed-start and queueable. The former are the ones existing in YARN today (are allocated from the central RM, and once allocated, are guaranteed to start). The latter make it possible to queue container requests in the NMs and will be used for distributed scheduling.
          The queuing of (queueable) container requests in the NMs is proposed in YARN-2883.

          Each NM will now also have a LocalRM (Local ResourceManager) that will receive all container requests from the AMs running on the same machine:

          • For the guaranteed-start container requests, the LocalRM acts as a proxy (YARN-2884), forwarding them to the central RM.
          • For the queueable container requests, the LocalRM is responsible for sending them directly to the NM queues (bypassing the central RM). Deciding the NMs where these requests are queued is based on the estimated waiting time in the NM queues, as discussed in YARN-2886.

          Based on some policy (YARN-2887), each AM will determine what type of containers to ask: only guaranteed-start, only queueable, or a mix thereof.
          For instance, an AM may request guaranteed-start containers for its tasks that are expected to be long-running, whereas it may ask for queueable containers for its short tasks (in which the back-and-forth with the central RM may be longer than the task execution time). This way we reduce the scheduling latency, while increasing the utilization of the cluster (if we had to go to the central RM for all these short tasks, some resources of the cluster might remain idle in the meanwhile).

          To ensure the NM queues remain balanced, we propose corrective mechanisms for NM queue rebalancing in YARN-2888.
          Moreover, to ensure no AM is abusing the system by asking too many queueable containers, we can impose a limit in the number of queueable containers that each AM can receive (YARN-2889).

          Show
          kkaranasos Konstantinos Karanasos added a comment - Adding some more details, now that we have added the first sub-tasks. In YARN-2882 we introduce two types of containers : guaranteed-start and queueable. The former are the ones existing in YARN today (are allocated from the central RM, and once allocated, are guaranteed to start). The latter make it possible to queue container requests in the NMs and will be used for distributed scheduling. The queuing of (queueable) container requests in the NMs is proposed in YARN-2883 . Each NM will now also have a LocalRM (Local ResourceManager) that will receive all container requests from the AMs running on the same machine: For the guaranteed-start container requests, the LocalRM acts as a proxy ( YARN-2884 ), forwarding them to the central RM. For the queueable container requests, the LocalRM is responsible for sending them directly to the NM queues (bypassing the central RM). Deciding the NMs where these requests are queued is based on the estimated waiting time in the NM queues, as discussed in YARN-2886 . Based on some policy ( YARN-2887 ), each AM will determine what type of containers to ask : only guaranteed-start, only queueable, or a mix thereof. For instance, an AM may request guaranteed-start containers for its tasks that are expected to be long-running, whereas it may ask for queueable containers for its short tasks (in which the back-and-forth with the central RM may be longer than the task execution time). This way we reduce the scheduling latency, while increasing the utilization of the cluster (if we had to go to the central RM for all these short tasks, some resources of the cluster might remain idle in the meanwhile). To ensure the NM queues remain balanced, we propose corrective mechanisms for NM queue rebalancing in YARN-2888 . Moreover, to ensure no AM is abusing the system by asking too many queueable containers, we can impose a limit in the number of queueable containers that each AM can receive ( YARN-2889 ).
          Hide
          sriramsrao Sriram Rao added a comment -

          Karthik Kambatla (1) Yes, the central RM can allocate optimistic containers, however, as you note it introduces extra latency. (2) Scaling the RM's allocation particularly when you have small tasks is another motivation as well.

          Show
          sriramsrao Sriram Rao added a comment - Karthik Kambatla (1) Yes, the central RM can allocate optimistic containers, however, as you note it introduces extra latency. (2) Scaling the RM's allocation particularly when you have small tasks is another motivation as well.
          Hide
          curino Carlo Curino added a comment -

          Karthik, you are correct...

          Karthik, glad you like the idea, and you ask good questions...

          This could be relevant to lower the load on the central RM (hence help with scale), in particular if we have a vast number of short-lived tasks (heavy scheduling cost for little work).
          (However, we have other ongoing work towards that, which we will post soon, hence the focus on utilization)

          What takes care of the "fast adaption" to node conditions is having a local queue (from which to pick more work if I am idle), and the notion of different containers types (i.e., I can kick out the optimistic containers if I am overbooked).
          With this in mind, the RM could be the one making scheduling decisions for queueable/optimistic containers as well, as you pointed out.

          What is constant (whether you make the scheduling decisions centrally or distributed), is the notion of different container types (see YARN-2882).
          This should be exposed to the AM, as it comes with very different level of guarantees on the container start/completion.
          Thus the AM need to know which type of containers to use for different tasks (e.g., short lived or non-critical-path containers can be optimistic).

          Show
          curino Carlo Curino added a comment - Karthik, you are correct... Karthik, glad you like the idea, and you ask good questions... This could be relevant to lower the load on the central RM (hence help with scale), in particular if we have a vast number of short-lived tasks (heavy scheduling cost for little work). (However, we have other ongoing work towards that, which we will post soon, hence the focus on utilization) What takes care of the "fast adaption" to node conditions is having a local queue (from which to pick more work if I am idle), and the notion of different containers types (i.e., I can kick out the optimistic containers if I am overbooked). With this in mind, the RM could be the one making scheduling decisions for queueable/optimistic containers as well, as you pointed out. What is constant (whether you make the scheduling decisions centrally or distributed), is the notion of different container types (see YARN-2882 ). This should be exposed to the AM, as it comes with very different level of guarantees on the container start/completion. Thus the AM need to know which type of containers to use for different tasks (e.g., short lived or non-critical-path containers can be optimistic).
          Hide
          kasha Karthik Kambatla added a comment -

          +1 to the idea, particularly to reduce the allocation latency. I definitely see Impala wanting to use this in the future. Not mentioned in the description, I believe scale is probably another big reason for distributed scheduling.

          Improve cluster utilization by opportunistically executing tasks otherwise idle resources on individual machines.

          A centralized RM could schedule tasks opportunistically too? Is the intention to quickly adapt to changing resource usage on the node, and the latency due to NM-RM-NM communication being too long to loose this window of opportunity?

          Show
          kasha Karthik Kambatla added a comment - +1 to the idea, particularly to reduce the allocation latency. I definitely see Impala wanting to use this in the future. Not mentioned in the description, I believe scale is probably another big reason for distributed scheduling. Improve cluster utilization by opportunistically executing tasks otherwise idle resources on individual machines. A centralized RM could schedule tasks opportunistically too? Is the intention to quickly adapt to changing resource usage on the node, and the latency due to NM-RM-NM communication being too long to loose this window of opportunity?
          Hide
          sriramsrao Sriram Rao added a comment -

          The proposal:

          1. Extend the NM to support task queueing. AM's can queue tasks directly at the NM's and the NM's will execute those tasks opportunistically.
          2. Extend the type of containers that YARN exposes:
            • CONSERVATIVE: This corresponds to containers allocated by YARN today.
            • OPTIMISTIC: This corresponds to a new class of containers, which will be queued for execution at the NM.
              This extension allows AM's to control what type of container they are requesting from the RM framework.
          3. Extend the NM with a "local RM" (i.e., a local Resource Manager) which uses local policies for deciding when an "OPTIMISTIC container" can be executed.

          We are exploring using timed leases for OPTIMISTIC containers to ensure minimum duration of execution. On the other hand, this mechanism allows NM's to free up resources and thus guarantee predictable start times for CONSERVATIVE containers.

          There are additional motivations for the uses of this feature and we will discuss them in follow-up comments.

          Show
          sriramsrao Sriram Rao added a comment - The proposal: Extend the NM to support task queueing. AM's can queue tasks directly at the NM's and the NM's will execute those tasks opportunistically. Extend the type of containers that YARN exposes: CONSERVATIVE: This corresponds to containers allocated by YARN today. OPTIMISTIC: This corresponds to a new class of containers, which will be queued for execution at the NM. This extension allows AM's to control what type of container they are requesting from the RM framework. Extend the NM with a "local RM" (i.e., a local Resource Manager) which uses local policies for deciding when an "OPTIMISTIC container" can be executed. We are exploring using timed leases for OPTIMISTIC containers to ensure minimum duration of execution. On the other hand, this mechanism allows NM's to free up resources and thus guarantee predictable start times for CONSERVATIVE containers. There are additional motivations for the uses of this feature and we will discuss them in follow-up comments.

            People

            • Assignee:
              kkaranasos Konstantinos Karanasos
              Reporter:
              sriramsrao Sriram Rao
            • Votes:
              0 Vote for this issue
              Watchers:
              83 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development