Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-397 RM Scheduler api enhancements
  3. YARN-1042

add ability to specify affinity/anti-affinity in container requests

    Details

    • Type: Sub-task
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 3.0.0-alpha1
    • Fix Version/s: None
    • Component/s: resourcemanager
    • Labels:
      None

      Description

      container requests to the AM should be able to request anti-affinity to ensure that things like Region Servers don't come up on the same failure zones.

      Similarly, you may be able to want to specify affinity to same host or rack without specifying which specific host/rack. Example: bringing up a small giraph cluster in a large YARN cluster would benefit from having the processes in the same rack purely for bandwidth reasons.

      1. YARN-1042.001.patch
        54 kB
        Weiwei Yang
      2. YARN-1042.002.patch
        57 kB
        Weiwei Yang
      3. YARN-1042-demo.patch
        15 kB
        Junping Du
      4. YARN-1042-design-doc.pdf
        68 kB
        Weiwei Yang
      5. YARN-1042-global-scheduling.poc.1.patch
        60 kB
        Wangda Tan

        Issue Links

          Activity

          Hide
          kkaranasos Konstantinos Karanasos added a comment -

          Wangda Tan, I think you are right that exposing min/max cardinality might be complicated for the user...

          So, you are suggesting to expose expose to the user explicit affinity and anti-affinity constraints, but interpret them under the hood as more complex unified constraints?
          I think it is a good approach. This way we keep it simple for the end user, and we create the hooks for more complicated cardinality and tag constraints as future work.

          Show
          kkaranasos Konstantinos Karanasos added a comment - Wangda Tan , I think you are right that exposing min/max cardinality might be complicated for the user... So, you are suggesting to expose expose to the user explicit affinity and anti-affinity constraints, but interpret them under the hood as more complex unified constraints? I think it is a good approach. This way we keep it simple for the end user, and we create the hooks for more complicated cardinality and tag constraints as future work.
          Hide
          leftnoteasy Wangda Tan added a comment -

          Thanks Konstantinos Karanasos for these suggestions.

          I agree to merge the implementation in scheduler to support affinity and anti-affinity, we should not have duplicated code to support both of (anti-)affinity.

          However, personally I'm not in favor of min-cardinality / max-cardinality, two reasons:
          1) Comparing to affinity-to / anti-affinity-to, it is not straightforward enough to a normal YARN user. (of course, it is easy for a PhD )
          2) We have to tradeoff between syntax flexibility and feature availability. Users will be happy if YARN allow them to specify as many constraints as they want, but they will be soon frustrated if we cannot give them in time. For example, user can specify allocate min=10/max=10 MPI tasks each host, but in reality, it could be quite hard in a busy cluster. And it gonna be hard for YARN to optimize such constraints, for example, how to preempt containers to satisfy min=max=10 cardinality.

          I have a simpler suggestion to handle most possible use cases as the first step. Which is:

          • Extend existing ResourceRequest, by adding a new placementConstraintExpression
          • Simple anti-affinity (min=max=1)
          • Simple affinity (min=2,max=infinity)
          • Within app and between app
          • Once tags are allowed to specified in ResourceRequest, we can support (anti-)affinity for tags

          The API could looks like:

          • affinity-to app=app_1234_0002
          • anti-affinity-to app=app_1234_0002
          • anti-affinity-to tag="Hbase-master"
          • anti-affinity-to app=app_1234_0002,tag="HBase-master"

          And more rich syntax (like affinity to rack / cluster, etc.) we can think more when doing YARN-4902.

          Show
          leftnoteasy Wangda Tan added a comment - Thanks Konstantinos Karanasos for these suggestions. I agree to merge the implementation in scheduler to support affinity and anti-affinity, we should not have duplicated code to support both of (anti-)affinity. However, personally I'm not in favor of min-cardinality / max-cardinality, two reasons: 1) Comparing to affinity-to / anti-affinity-to, it is not straightforward enough to a normal YARN user. (of course, it is easy for a PhD ) 2) We have to tradeoff between syntax flexibility and feature availability. Users will be happy if YARN allow them to specify as many constraints as they want, but they will be soon frustrated if we cannot give them in time. For example, user can specify allocate min=10/max=10 MPI tasks each host, but in reality, it could be quite hard in a busy cluster. And it gonna be hard for YARN to optimize such constraints, for example, how to preempt containers to satisfy min=max=10 cardinality. I have a simpler suggestion to handle most possible use cases as the first step. Which is: Extend existing ResourceRequest, by adding a new placementConstraintExpression Simple anti-affinity (min=max=1) Simple affinity (min=2,max=infinity) Within app and between app Once tags are allowed to specified in ResourceRequest, we can support (anti-)affinity for tags The API could looks like: affinity-to app=app_1234_0002 anti-affinity-to app=app_1234_0002 anti-affinity-to tag="Hbase-master" anti-affinity-to app=app_1234_0002,tag="HBase-master" And more rich syntax (like affinity to rack / cluster, etc.) we can think more when doing YARN-4902 .
          Hide
          kkaranasos Konstantinos Karanasos added a comment -

          Hi Wangda Tan. Thanks for working on this. Being able to specify (anti-)affinity constraints is a feature that we are very interested in as well.
          I gave a quick look at your patch. As I mentioned to you in the past, we also have a working prototype for defining and imposing similar constraints (I had uploaded an initial version of it at YARN-4902 some time back). We can share more code if it helps.

          For now let's focus on the API – we can later discuss how to make the scheduling efficient (e.g., using the global scheduling and so on).

          I would suggest to unify the affinity and anti-affinity constraints and make them a bit more general too. For instance, we can specify "cardinality" constraints of the form "put no more than 4 HBase masters at each rack". This constraint is useful but cannot be captured by pure affinity or anti-affinity constraints.
          I assume that its task can be associated with a set of labels (similar to the ones we are discussing in YARN-3409). This is not there at the moment, but can be added.
          Then, we could use the following placement constraint form:

           {task, target-tasks, cluster-scope, min-cardinality, max-cardinality}. 

          In the above constraint, task is a specific task, target-tasks is a label (e.g., "HBase-master", "app_000123" or "latency-critical") that specifies a set of tasks that are already scheduled in the cluster, cluster-scope is something among NODE and RACK (and possibly others in the future), min-cardinality is the minimum number of tasks with label target-tasks that can appear at the same cluster-scope with task, and max-cardinality is the respective maximum cardinality.
          Using the min- and max-cardinality, we can specify affinity, anti-affinity and other cardinality constraints.

          For example,

          {task_001, "HBase-master", NODE, 1, 1}

          is an anti-affinity constraint (don't put more than one HBase master at a node.
          Likewise,

          {task_001, "HBase-region-server", NODE, 2, MAX_INT}

          is an affinity constraint for region servers.

          Let me know how it looks.

          Show
          kkaranasos Konstantinos Karanasos added a comment - Hi Wangda Tan . Thanks for working on this. Being able to specify (anti-)affinity constraints is a feature that we are very interested in as well. I gave a quick look at your patch. As I mentioned to you in the past, we also have a working prototype for defining and imposing similar constraints (I had uploaded an initial version of it at YARN-4902 some time back). We can share more code if it helps. For now let's focus on the API – we can later discuss how to make the scheduling efficient (e.g., using the global scheduling and so on). I would suggest to unify the affinity and anti-affinity constraints and make them a bit more general too. For instance, we can specify "cardinality" constraints of the form "put no more than 4 HBase masters at each rack". This constraint is useful but cannot be captured by pure affinity or anti-affinity constraints. I assume that its task can be associated with a set of labels (similar to the ones we are discussing in YARN-3409 ). This is not there at the moment, but can be added. Then, we could use the following placement constraint form: {task, target-tasks, cluster-scope, min-cardinality, max-cardinality}. In the above constraint, task is a specific task, target-tasks is a label (e.g., "HBase-master", "app_000123" or "latency-critical") that specifies a set of tasks that are already scheduled in the cluster, cluster-scope is something among NODE and RACK (and possibly others in the future), min-cardinality is the minimum number of tasks with label target-tasks that can appear at the same cluster-scope with task , and max-cardinality is the respective maximum cardinality. Using the min- and max-cardinality , we can specify affinity, anti-affinity and other cardinality constraints. For example, {task_001, "HBase-master", NODE, 1, 1} is an anti-affinity constraint (don't put more than one HBase master at a node. Likewise, {task_001, "HBase-region-server", NODE, 2, MAX_INT} is an affinity constraint for region servers. Let me know how it looks.
          Hide
          leftnoteasy Wangda Tan added a comment -


          Opened YARN-5907 as umbrella JIRA of this ticket.

          Show
          leftnoteasy Wangda Tan added a comment - Opened YARN-5907 as umbrella JIRA of this ticket.
          Hide
          leftnoteasy Wangda Tan added a comment -

          I also attached POC patch on top of latest YARN-5139.

          Here's anti-affinity / affinity talk and demo I and Varun Vasudev made for Hadoop summit this year: https://www.youtube.com/watch?v=1M5bEwHj5Wc (Started around 10 min 30 sec).

          Any feedbacks are welcome!

          Show
          leftnoteasy Wangda Tan added a comment - I also attached POC patch on top of latest YARN-5139 . Here's anti-affinity / affinity talk and demo I and Varun Vasudev made for Hadoop summit this year: https://www.youtube.com/watch?v=1M5bEwHj5Wc (Started around 10 min 30 sec). Any feedbacks are welcome!
          Hide
          leftnoteasy Wangda Tan added a comment -

          Hi Weiwei Yang, thanks for the design doc, patch and ideas.

          I think there're two major problems that we need to solve.
          1) What the API should look like?
          2) How to make it efficient enough?

          For 1), I'm agree with most of the comments from chong chen, existing ResourceRequest API has some flaws that we may need to think how to fix them. We have a design doc attached to YARN-4902 with a lot of thoughts regarding to this, you can read it if you have interests.
          And for anti-affinity/affinity API itself, we may need to consider anti-affinity/affinity between applications/priorities. Sometimes different services can be launched by one single YARN application, different roles from different services could possibly have different affinity/anti-affinity requirements. And one service/job can affinity/anti-affinity to another service/job.

          For 2), Existing scheduler is blindly looping all the nodes in the cluster, we need a global view to schedule them for better performance, I have opened YARN-5139 for the global scheduling framework and attached POC patch to it.

          Currently I'm working on a prototype patch for this, I want to make this feature based on YARN-5139 and probably based on YARN-4902.

          I'm interested in taking this forward, assigning it to myself.

          Would like to hear your thoughts.

          Show
          leftnoteasy Wangda Tan added a comment - Hi Weiwei Yang , thanks for the design doc, patch and ideas. I think there're two major problems that we need to solve. 1) What the API should look like? 2) How to make it efficient enough? For 1), I'm agree with most of the comments from chong chen , existing ResourceRequest API has some flaws that we may need to think how to fix them. We have a design doc attached to YARN-4902 with a lot of thoughts regarding to this, you can read it if you have interests. And for anti-affinity/affinity API itself, we may need to consider anti-affinity/affinity between applications/priorities. Sometimes different services can be launched by one single YARN application, different roles from different services could possibly have different affinity/anti-affinity requirements. And one service/job can affinity/anti-affinity to another service/job. For 2), Existing scheduler is blindly looping all the nodes in the cluster, we need a global view to schedule them for better performance, I have opened YARN-5139 for the global scheduling framework and attached POC patch to it. Currently I'm working on a prototype patch for this, I want to make this feature based on YARN-5139 and probably based on YARN-4902 . I'm interested in taking this forward, assigning it to myself. Would like to hear your thoughts.
          Hide
          cheersyang Weiwei Yang added a comment -

          Hello chong chen

          Thanks for your thoughts.

          Basically, one application/attempt may include multiple group of container requests, and each group includes multiple container requests.

          This seems like the idea to support specify a container allocation rule per request, e.g app1 asks for 3 containers with policy affinity then asks for 4 container with anti-affinity policy. This requires AM-RM protocol change and that's why it was not in my patch yet. What I have done is to let you be able to specify a rule per app.

          Fundamentally, it is hard for scheduler to make a right judgement without knowing the raw container request. The situation will get worse when dealing with affinity and anti-affinity or even gang scheduling etc.

          I do not fully understand the meaning of "raw container request" in your comments, but I think I understand your point.
          While implementing container allocation policies. The hard part for me is, scheduler is not aware of the context when it tries to allocate a container. Ideally, it needs to know what are the corresponding containers this container related to (they are not independent, like the group you mentioned), also it needs to know the scheduling details such as how long a request is being waiting for and how many requests are waiting, etc ... These information is very helpful to help scheduler to make more complex decisions.

          Show
          cheersyang Weiwei Yang added a comment - Hello chong chen Thanks for your thoughts. Basically, one application/attempt may include multiple group of container requests, and each group includes multiple container requests. This seems like the idea to support specify a container allocation rule per request, e.g app1 asks for 3 containers with policy affinity then asks for 4 container with anti-affinity policy. This requires AM-RM protocol change and that's why it was not in my patch yet. What I have done is to let you be able to specify a rule per app. Fundamentally, it is hard for scheduler to make a right judgement without knowing the raw container request. The situation will get worse when dealing with affinity and anti-affinity or even gang scheduling etc. I do not fully understand the meaning of "raw container request" in your comments, but I think I understand your point. While implementing container allocation policies. The hard part for me is, scheduler is not aware of the context when it tries to allocate a container. Ideally, it needs to know what are the corresponding containers this container related to (they are not independent, like the group you mentioned), also it needs to know the scheduling details such as how long a request is being waiting for and how many requests are waiting, etc ... These information is very helpful to help scheduler to make more complex decisions.
          Hide
          cchen317 chong chen added a comment -

          A few comments and thoughts:

          1) YARN Application Model

          The current YARN job model only has two layers of requests, application/application attempt and container request. Container requests are independent from each other during scheduling phase. This model seems to work fine for existing simple scenario. However, when dealing with gang scheduling or affinity/anti-affinity etc, this is not enough. Essentially, all these requests introduce relationship among container requests. To easily express relationship, it seems to be nature to have a group concept here. Basically, one application/attempt may include multiple group of container requests, and each group includes multiple container requests.

          There will be policies within the group. For instance, you can say for group1 of container requests, application prefers to have anti-affinity policies over host, however, for group2, it would be nice to have affinity policy within rack. Furthermore, with group concept, you can also express to say for group3, scheduler needs to make sure all container requests be satisfied before making the allocation decision or even more advanced, group4 needs to meet minimum 4 container requests at the same time before making the allocation decision etc. There can be policies among groups. For instance, if group1 represents hBase masters, group2 represents region servers, I may not want two groups sharing the same host.

          The bottom line is, each container cannot be considered independently any more, there could be cases scheduler needs to consider them together when making the optimum placement decision.

          2) Container Request and Scheduling Flaw

          To really handle affinity/gang scheduling properly, one should deal with this logic in scheduling and consider all container requests in the group as a whole. The logic should be in scheduler rather than RM layer and scheduler needs to know individual container requests within each group, so it can make proper scheduling decision. This leads to another potential design issue in current YARN.

          Currently, when AM sends container requests to RM and scheduler, it expands individual container requests into host/rack/any format. For instance, if I am asking for container request with preference "host1, host2, host3", assuming all are in the same rack rack1, instead of sending one raw container request to RM/Scheduler with raw preference list, it basically expand it to become 5 different objects with host1, host2, host3, rack1 and any in there. When scheduler receives information, it basically already lost the raw request. This is ok for single container request, but it will cause trouble when dealing with multiple container requests from the same application. Consider this case:

          6 hosts, two racks:

          rack1 (host1, host2, host3) rack2 (host4, host5, host6)

          When application requests two containers with different data locality preference:

          c1: host1, host2, host4
          c2: host2, host3, host5

          This will end up with following container request list when client sending request to RM/Scheduler:

          host1: 1 instance
          host2: 2 instances
          host3: 1 instance
          host4: 1 instance
          host5: 1 instance
          rack1: 2 instances
          rack2: 2 instances
          any: 2 instances

          During scheduling, assume when host1 heartbeat comes, scheduler assigns container to host1, before Application master receives the container and updates its requests, if next host heartbeat is host4, scheduler will assign again even though they belong to the same container.

          Fundamentally, it is hard for scheduler to make a right judgement without knowing the raw container request. The situation will get worse when dealing with affinity and anti-affinity or even gang scheduling etc.

          Ideally, YARN resource allocation request should be changed to send in raw container requests instead of expanded one. It should be scheduler module responsibility to interpret raw request and build up optimized data structure to use it.

          Show
          cchen317 chong chen added a comment - A few comments and thoughts: 1) YARN Application Model The current YARN job model only has two layers of requests, application/application attempt and container request. Container requests are independent from each other during scheduling phase. This model seems to work fine for existing simple scenario. However, when dealing with gang scheduling or affinity/anti-affinity etc, this is not enough. Essentially, all these requests introduce relationship among container requests. To easily express relationship, it seems to be nature to have a group concept here. Basically, one application/attempt may include multiple group of container requests, and each group includes multiple container requests. There will be policies within the group. For instance, you can say for group1 of container requests, application prefers to have anti-affinity policies over host, however, for group2, it would be nice to have affinity policy within rack. Furthermore, with group concept, you can also express to say for group3, scheduler needs to make sure all container requests be satisfied before making the allocation decision or even more advanced, group4 needs to meet minimum 4 container requests at the same time before making the allocation decision etc. There can be policies among groups. For instance, if group1 represents hBase masters, group2 represents region servers, I may not want two groups sharing the same host. The bottom line is, each container cannot be considered independently any more, there could be cases scheduler needs to consider them together when making the optimum placement decision. 2) Container Request and Scheduling Flaw To really handle affinity/gang scheduling properly, one should deal with this logic in scheduling and consider all container requests in the group as a whole. The logic should be in scheduler rather than RM layer and scheduler needs to know individual container requests within each group, so it can make proper scheduling decision. This leads to another potential design issue in current YARN. Currently, when AM sends container requests to RM and scheduler, it expands individual container requests into host/rack/any format. For instance, if I am asking for container request with preference "host1, host2, host3", assuming all are in the same rack rack1, instead of sending one raw container request to RM/Scheduler with raw preference list, it basically expand it to become 5 different objects with host1, host2, host3, rack1 and any in there. When scheduler receives information, it basically already lost the raw request. This is ok for single container request, but it will cause trouble when dealing with multiple container requests from the same application. Consider this case: 6 hosts, two racks: rack1 (host1, host2, host3) rack2 (host4, host5, host6) When application requests two containers with different data locality preference: c1: host1, host2, host4 c2: host2, host3, host5 This will end up with following container request list when client sending request to RM/Scheduler: host1: 1 instance host2: 2 instances host3: 1 instance host4: 1 instance host5: 1 instance rack1: 2 instances rack2: 2 instances any: 2 instances During scheduling, assume when host1 heartbeat comes, scheduler assigns container to host1, before Application master receives the container and updates its requests, if next host heartbeat is host4, scheduler will assign again even though they belong to the same container. Fundamentally, it is hard for scheduler to make a right judgement without knowing the raw container request. The situation will get worse when dealing with affinity and anti-affinity or even gang scheduling etc. Ideally, YARN resource allocation request should be changed to send in raw container requests instead of expanded one. It should be scheduler module responsibility to interpret raw request and build up optimized data structure to use it.
          Hide
          cheersyang Weiwei Yang added a comment -

          Hi Steve

          From your comments, I think you want yarn API to support requesting containers with given rules per request. Guess you want it to have the API looks like
          in AMRMClient.java, ContainerRequest supports following arguments

          • Resource capability
          • String[] nodes
          • String[] racks
          • Priority priority
          • boolean relaxLocality
          • String nodeLabelExpression
          • ContainerAllocateRule containerAllocateRule

          the last one (in bold text) is the new argument for you to specify a particular rule (we will discuss rule details later). Problem here is, RM needs to know which application (or role in slider's context) the rule applies for, only then RM can assign containers by obeying that rule when dealing with the request coming from the same application. However, if you only specify the rule per container request, how can RM know what are the containers need to be considered to when applies the rule ? Let me give an example to explain

           ContainerRequest containerReq1 = new ContainerRequest(capability1, nodes, racks, priority, affinityRequiredRule);
           amClient.addContainerRequest(containerReq1);
           AllocateResponse allocResponse = amClient.allocate(0.1f)
          

          The AllocationRequest AM sent to RM only tells RM that these container requests need to use affinityRequiredRule, but RM does not know which containers this request affine with, so RM cannot place the rule during the allocation. This is the reason why I propose to register the mapping about

          application <-> allocation-rule
          

          when client submits the application, and keep it in the RM context, so RM can apply the rule when there is a request coming from AM.

          Show
          cheersyang Weiwei Yang added a comment - Hi Steve From your comments, I think you want yarn API to support requesting containers with given rules per request. Guess you want it to have the API looks like in AMRMClient.java, ContainerRequest supports following arguments Resource capability String[] nodes String[] racks Priority priority boolean relaxLocality String nodeLabelExpression ContainerAllocateRule containerAllocateRule the last one (in bold text) is the new argument for you to specify a particular rule (we will discuss rule details later). Problem here is, RM needs to know which application (or role in slider's context) the rule applies for, only then RM can assign containers by obeying that rule when dealing with the request coming from the same application. However, if you only specify the rule per container request, how can RM know what are the containers need to be considered to when applies the rule ? Let me give an example to explain ContainerRequest containerReq1 = new ContainerRequest(capability1, nodes, racks, priority, affinityRequiredRule); amClient.addContainerRequest(containerReq1); AllocateResponse allocResponse = amClient.allocate(0.1f) The AllocationRequest AM sent to RM only tells RM that these container requests need to use affinityRequiredRule, but RM does not know which containers this request affine with, so RM cannot place the rule during the allocation. This is the reason why I propose to register the mapping about application <-> allocation-rule when client submits the application, and keep it in the RM context, so RM can apply the rule when there is a request coming from AM.
          Hide
          cheersyang Weiwei Yang added a comment -

          Thanks Steve I have updated my patch based on your comment 2 and 3. But for 1, we need some more discussion. I am glad for all the suggestions.

          Show
          cheersyang Weiwei Yang added a comment - Thanks Steve I have updated my patch based on your comment 2 and 3. But for 1, we need some more discussion. I am glad for all the suggestions.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          A bit more detail on how slider uses placement, specifically, we have different policies for different container roles.

          We have the notion of different component types (region servers, hbase masters, kafka nodes), with different placement policies related to affinity/anti-affinity and whether we use the history of previous locations.

          Examples

          • HBase masters: don't care where they come up, as long as they have anti-affinity (that's a goal)
          • region servers: ideally, anti-affinity + history. We remember where they were last run and ask for one RS on every host that had one before (if there were 2 region servers on a host, only one is asked for, for better distribution)
          • Kafka nodes:ask for the previous locations, wait 20+ minutes for YARN to satisfy the request, because it is so expensive to rebuild

          this gives us the following policies

          • anywhere
          • anywhere but use history
          • anti-affinity-preferred
          • antiti-affinity required (not yet implemented)
          • strict: wherever used before is where needed next. There's no attempt to escalate placement if the request is unsatisfied.

          We assign one role per YARN request priority, which is how we map a granted container to a requested role. This implies that we have one placement policy per request priority. I'd be happy if YARN supported that, rather than allowing us to specify a policy per request, which would make aggregation of requests unaggregateable and hence unscaleable

          Show
          stevel@apache.org Steve Loughran added a comment - A bit more detail on how slider uses placement, specifically, we have different policies for different container roles. We have the notion of different component types (region servers, hbase masters, kafka nodes), with different placement policies related to affinity/anti-affinity and whether we use the history of previous locations. Examples HBase masters: don't care where they come up, as long as they have anti-affinity (that's a goal) region servers: ideally, anti-affinity + history. We remember where they were last run and ask for one RS on every host that had one before (if there were 2 region servers on a host, only one is asked for, for better distribution) Kafka nodes:ask for the previous locations, wait 20+ minutes for YARN to satisfy the request, because it is so expensive to rebuild this gives us the following policies anywhere anywhere but use history anti-affinity-preferred antiti-affinity required (not yet implemented) strict: wherever used before is where needed next. There's no attempt to escalate placement if the request is unsatisfied. We assign one role per YARN request priority, which is how we map a granted container to a requested role. This implies that we have one placement policy per request priority. I'd be happy if YARN supported that, rather than allowing us to specify a policy per request, which would make aggregation of requests unaggregateable and hence unscaleable
          Hide
          stevel@apache.org Steve Loughran added a comment -

          thanks for this. I'm afraid I don't know enough about the YARN scheduling to review it properly —hopefully the experts will.

          meanwhile,

          1. does this let us specify different rules for different container requests? The uses we see in slider do differentiate different requests (e.g. we care more about affinity of hbase masters than we do about workers).
          2. how would we set up the enum for if we ever want to add different placement policies in future? I think the strategy would be to have a "no-preferences" policy, which would be the default, and different from affinity (==I really want them on the same node) and anti-affinity. Then we could have a switch statement to choose placement, rather than just an isAffinity() predicate.
          3. hadoop's code rules are "two spaces, no tabs".
          Show
          stevel@apache.org Steve Loughran added a comment - thanks for this. I'm afraid I don't know enough about the YARN scheduling to review it properly —hopefully the experts will. meanwhile, does this let us specify different rules for different container requests? The uses we see in slider do differentiate different requests (e.g. we care more about affinity of hbase masters than we do about workers). how would we set up the enum for if we ever want to add different placement policies in future? I think the strategy would be to have a "no-preferences" policy, which would be the default, and different from affinity (==I really want them on the same node) and anti-affinity. Then we could have a switch statement to choose placement, rather than just an isAffinity() predicate. hadoop's code rules are "two spaces, no tabs".
          Hide
          cheersyang Weiwei Yang added a comment -

          I have worked out a patch, it is not all complete, but the part I've done works the way I expected. I tested on my 5 nodes cluster against AFFINITY and ANTI_AFFINITY in NODE scope. Please kindly help to review the patch, appreciate for comments and suggestions.

          Work left to be done are
          1. Complete container allocate handlers for RACK scope
          2. Write test cases to test rules in RACK scope
          3. Add support to argument maxTimeAwaitBeforeCompromise
          4. Complete code changes on other schedulers

          Show
          cheersyang Weiwei Yang added a comment - I have worked out a patch, it is not all complete, but the part I've done works the way I expected. I tested on my 5 nodes cluster against AFFINITY and ANTI_AFFINITY in NODE scope. Please kindly help to review the patch, appreciate for comments and suggestions. Work left to be done are 1. Complete container allocate handlers for RACK scope 2. Write test cases to test rules in RACK scope 3. Add support to argument maxTimeAwaitBeforeCompromise 4. Complete code changes on other schedulers
          Hide
          cheersyang Weiwei Yang added a comment -

          Hi Steve

          Thanks for the comments. I correct the words. And regarding to

          That way the AM can choose to wait 1 minute or more for an anti-affine placement before giving up and accepting a node already in use.

          this is exactly the reason I proposed the PREFERRED rules, you can set a max time await before compromising the rule. For example, you use ANTI_AFFINITY rule and set 1 minute to the max wait time, then RM will wait for at least 1 minute before assigning a container to a node which already has a container running on it. Or ... forget about REQUIRED or PREFERRED, we can directly define these preference in ContainerAllocateRule class, with an attribute like maxTimeAwaitBeforeCompromise, default it is 0, which means never compromise the rule (REQUIRED).

          Show
          cheersyang Weiwei Yang added a comment - Hi Steve Thanks for the comments. I correct the words. And regarding to That way the AM can choose to wait 1 minute or more for an anti-affine placement before giving up and accepting a node already in use. this is exactly the reason I proposed the PREFERRED rules, you can set a max time await before compromising the rule. For example, you use ANTI_AFFINITY rule and set 1 minute to the max wait time, then RM will wait for at least 1 minute before assigning a container to a node which already has a container running on it. Or ... forget about REQUIRED or PREFERRED, we can directly define these preference in ContainerAllocateRule class, with an attribute like maxTimeAwaitBeforeCompromise , default it is 0, which means never compromise the rule (REQUIRED).
          Hide
          stevel@apache.org Steve Loughran added a comment -

          I like this, though I'd also like PREFERRED to have two Rs in the middle .

          Thinking about how I'd use this in slider, I'd probably want to keep the escalation logic, when to decide when to accept shared-note placement, in my own code. That way the AM can choose to wait 1 minute or more for an anti-affine placement before giving up and accepting a node already in use. We already do that when asking for a container back on the host where an instance ran previously.

          Show
          stevel@apache.org Steve Loughran added a comment - I like this, though I'd also like PREFERRED to have two Rs in the middle . Thinking about how I'd use this in slider, I'd probably want to keep the escalation logic, when to decide when to accept shared-note placement, in my own code. That way the AM can choose to wait 1 minute or more for an anti-affine placement before giving up and accepting a node already in use. We already do that when asking for a container back on the host where an instance ran previously.
          Hide
          cheersyang Weiwei Yang added a comment -

          I am thinking about following approach, appreciate for suggestions : )

          In ApplicationSubmissionContext class, add a new argument to indicate the container allocation rule in terms of affinity/anti-affinity. RM will follow the certain rules to allocate containers for this application. The argument is an instance of class ContainerAllocationRule(new), this class defines several types allocation rule, such as

          • AFFINITY_REQUIRED: containers MUST be allocated on the same host/rack
          • AFFINITY_PREFERED: prefer to allocate containers on same host/rack if possible
          • ANTI_AFFINITY_REQUIRED: containers MUST be allocated on different hosts/racks
          • ANTI_AFFINITY_PREFERED: prefer to allocate containers on different hosts/racks if possible

          Each of these rules will have a handler on the RM side to add some control on container allocation. When a client submits an application with a certain ContainerAllocationRule to RM, this information will be added into ApplicationAttemptId (because the allocation rule is defined per application), when RM uses registered scheduler to allocate containers, it can retrieve the rule from ApplicationAttemptId and call particular handler during the allocation. The code can be added into SchedulerApplicationAttempt.pullNewlyAllocatedContainersAndNMTokens so to avoid modifying all schedulers.

          Show
          cheersyang Weiwei Yang added a comment - I am thinking about following approach, appreciate for suggestions : ) In ApplicationSubmissionContext class, add a new argument to indicate the container allocation rule in terms of affinity/anti-affinity. RM will follow the certain rules to allocate containers for this application. The argument is an instance of class ContainerAllocationRule(new), this class defines several types allocation rule, such as AFFINITY_REQUIRED: containers MUST be allocated on the same host/rack AFFINITY_PREFERED: prefer to allocate containers on same host/rack if possible ANTI_AFFINITY_REQUIRED: containers MUST be allocated on different hosts/racks ANTI_AFFINITY_PREFERED: prefer to allocate containers on different hosts/racks if possible Each of these rules will have a handler on the RM side to add some control on container allocation. When a client submits an application with a certain ContainerAllocationRule to RM, this information will be added into ApplicationAttemptId (because the allocation rule is defined per application), when RM uses registered scheduler to allocate containers, it can retrieve the rule from ApplicationAttemptId and call particular handler during the allocation. The code can be added into SchedulerApplicationAttempt.pullNewlyAllocatedContainersAndNMTokens so to avoid modifying all schedulers.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          I have a proposal document that I am going to post shortly.

          But the shorter answer is that the YARN effort is complex and will take a while. The SLIDER-82 could take a shortcut by implementing it in the AM - it won't be fool-proof without RM support, but it should get you off the ground.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - I have a proposal document that I am going to post shortly. But the shorter answer is that the YARN effort is complex and will take a while. The SLIDER-82 could take a shortcut by implementing it in the AM - it won't be fool-proof without RM support, but it should get you off the ground.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          I now think that SLIDER-82 could be done without this; we just use the blacklisting feature of the YARN request API to blacklist all nodes on which we have a running instance, expand the cluster gradually & release those nodes where there's affinity conflict.

          this doesn't mean that it'd be convenient to have this in the YARN scheduler, only that I think we can get by without it

          Show
          stevel@apache.org Steve Loughran added a comment - I now think that SLIDER-82 could be done without this; we just use the blacklisting feature of the YARN request API to blacklist all nodes on which we have a running instance, expand the cluster gradually & release those nodes where there's affinity conflict. this doesn't mean that it'd be convenient to have this in the YARN scheduler, only that I think we can get by without it
          Hide
          cheersyang Weiwei Yang added a comment -

          Sorry, not an alternative, SLIDER-82 is depending on this jira.

          Show
          cheersyang Weiwei Yang added a comment - Sorry, not an alternative, SLIDER-82 is depending on this jira.
          Hide
          cheersyang Weiwei Yang added a comment -

          Sorry, not an alternative, SLIDER-82 is depending on this jira.

          Show
          cheersyang Weiwei Yang added a comment - Sorry, not an alternative, SLIDER-82 is depending on this jira.
          Hide
          cheersyang Weiwei Yang added a comment -

          There is no update for a long time, anything new here ? This looks like a nice feature that should be helping us a lot, is this a correct direction that we should put some more efforts to get this done in RM side ? I noticed there are alternatives in slider project, e.g SLIDER-82.
          Please advise.

          Show
          cheersyang Weiwei Yang added a comment - There is no update for a long time, anything new here ? This looks like a nice feature that should be helping us a lot, is this a correct direction that we should put some more efforts to get this done in RM side ? I noticed there are alternatives in slider project, e.g SLIDER-82 . Please advise.
          Hide
          yanghaogn Yang Hao added a comment -

          A configuration should be add for yarn-site.xml

          Show
          yanghaogn Yang Hao added a comment - A configuration should be add for yarn-site.xml
          Hide
          djp Junping Du added a comment -

          Sure. Arun, thanks for working on this. Please go ahead!

          Show
          djp Junping Du added a comment - Sure. Arun, thanks for working on this. Please go ahead!
          Hide
          acmurthy Arun C Murthy added a comment -

          Junping Du Do you mind if I take this over? I can do this concurrently with YARN-796 (which I already have a patch). Tx!

          Show
          acmurthy Arun C Murthy added a comment - Junping Du Do you mind if I take this over? I can do this concurrently with YARN-796 (which I already have a patch). Tx!
          Hide
          djp Junping Du added a comment -

          Are you mean YARN-624? I didn't follow up this JIRA before, and looks like it has been quiet for a while. Do you know what's status there? Shall we add dependency to that JIRA?

          Show
          djp Junping Du added a comment - Are you mean YARN-624 ? I didn't follow up this JIRA before, and looks like it has been quiet for a while. Do you know what's status there? Shall we add dependency to that JIRA?
          Hide
          stevel@apache.org Steve Loughran added a comment -

          I envisage it as RM-side. Once gang scheduling is in, you could then request of the AM that you get three containers on separate Nodes or (after a time period), fail with an error.

          Show
          stevel@apache.org Steve Loughran added a comment - I envisage it as RM-side. Once gang scheduling is in, you could then request of the AM that you get three containers on separate Nodes or (after a time period), fail with an error.
          Hide
          djp Junping Du added a comment -

          Thanks for comments. Luke!

          Although you can do a lot at the app side with container filtering, protocol and scheduler support will make it more efficient. I guess the intention of the jira is more for the latter, that affinity support should be app independent.

          Oh. It remind me that intention of JIRA may be on RM side (so the title may be replaced from container requests to resource request) as long lived services may have different AppMaster from default one that I change here. Also, I agree that do it in RM side may be more efficient as no need to return containers in app side for against affinity rules.
          However, my concern is it may take extra complexity to RM as it make RM aware the affinity/anti-affinity group of tasks (or resource request). IMO, one simplicity and beauty for YARN is: RM only take care abstracted resource request, and do container allocation accordingly. I am not sure if putting resource request into affinity/anti-affinity groups and tracking resource request relationship hurt this beauty. Thoughts?

          Show
          djp Junping Du added a comment - Thanks for comments. Luke! Although you can do a lot at the app side with container filtering, protocol and scheduler support will make it more efficient. I guess the intention of the jira is more for the latter, that affinity support should be app independent. Oh. It remind me that intention of JIRA may be on RM side (so the title may be replaced from container requests to resource request) as long lived services may have different AppMaster from default one that I change here. Also, I agree that do it in RM side may be more efficient as no need to return containers in app side for against affinity rules. However, my concern is it may take extra complexity to RM as it make RM aware the affinity/anti-affinity group of tasks (or resource request). IMO, one simplicity and beauty for YARN is: RM only take care abstracted resource request, and do container allocation accordingly. I am not sure if putting resource request into affinity/anti-affinity groups and tracking resource request relationship hurt this beauty. Thoughts?
          Hide
          vicaya Luke Lu added a comment -

          BTW, it seems the effort is more on application side. Do we think it is better to move to MAPREDUCE project?

          Although you can do a lot at the app side with container filtering, protocol and scheduler support will make it more efficient. I guess the intention of the jira is more for the latter, that affinity support should be app independent.

          Show
          vicaya Luke Lu added a comment - BTW, it seems the effort is more on application side. Do we think it is better to move to MAPREDUCE project? Although you can do a lot at the app side with container filtering, protocol and scheduler support will make it more efficient. I guess the intention of the jira is more for the latter, that affinity support should be app independent.
          Hide
          djp Junping Du added a comment -

          Hi Steve Loughran, as you are the creator of this jira and probably consume this API in HOYA project. It is great if you can provide some input here. Thx!

          Show
          djp Junping Du added a comment - Hi Steve Loughran , as you are the creator of this jira and probably consume this API in HOYA project. It is great if you can provide some input here. Thx!
          Hide
          djp Junping Du added a comment -

          BTW, it seems the effort is more on application side. Do we think it is better to move to MAPREDUCE project?

          Show
          djp Junping Du added a comment - BTW, it seems the effort is more on application side. Do we think it is better to move to MAPREDUCE project?
          Hide
          djp Junping Du added a comment -

          Attach a patch to showcase above proposal. This is only a demo patch and haven't included any unit tests so far.
          I think there are several open questions here before we move on next step:

          1. this affinity/anti-affinity rule is bi-direction or not? If task A.affinity(B) is true then B.affinity(A) is always true or not? I guess it is not as A may prefer a list of nodes which makes the relationship non-symmetric. Also, that is how we can differ A prefer to live with B and C from A prefer to live B or C. Isn't it?

          2. which rule's priority is higher in case affinity rule against with anti-affinity rule? In demo patch, affinity rule plays as higher priority but I am not sure if this is true in real case. Do we want to make it configurable? Or we just make sure rules updated later can override previous one if conflict.

          3. Currently, the affinity/anti-affinity is only considered in node level, do we want to expand it to other level i.e. rack level in future?

          4. The API now is to add a list of taskId as affinity/anti-affinity tasks. Is that easy to consume in application prospective?

          5. the affinity/anti-affinity rules is a must conform rule in current implementation which may cause task starve for longer time, do we think about more leisure rule?

          Welcome to comments. Thx!

          Show
          djp Junping Du added a comment - Attach a patch to showcase above proposal. This is only a demo patch and haven't included any unit tests so far. I think there are several open questions here before we move on next step: 1. this affinity/anti-affinity rule is bi-direction or not? If task A.affinity(B) is true then B.affinity(A) is always true or not? I guess it is not as A may prefer a list of nodes which makes the relationship non-symmetric. Also, that is how we can differ A prefer to live with B and C from A prefer to live B or C. Isn't it? 2. which rule's priority is higher in case affinity rule against with anti-affinity rule? In demo patch, affinity rule plays as higher priority but I am not sure if this is true in real case. Do we want to make it configurable? Or we just make sure rules updated later can override previous one if conflict. 3. Currently, the affinity/anti-affinity is only considered in node level, do we want to expand it to other level i.e. rack level in future? 4. The API now is to add a list of taskId as affinity/anti-affinity tasks. Is that easy to consume in application prospective? 5. the affinity/anti-affinity rules is a must conform rule in current implementation which may cause task starve for longer time, do we think about more leisure rule? Welcome to comments. Thx!
          Hide
          djp Junping Du added a comment -

          My initial proposal is: add affinity and anti-affinity list of TaskAttemptId in ContainerRequest, then in assigning map task to allocated containers, take affinity/anti-affinity relationship into consideration by looking at assignedRequests (tid -> container -> nodeID). Any comments?

          Show
          djp Junping Du added a comment - My initial proposal is: add affinity and anti-affinity list of TaskAttemptId in ContainerRequest, then in assigning map task to allocated containers, take affinity/anti-affinity relationship into consideration by looking at assignedRequests (tid -> container -> nodeID). Any comments?
          Hide
          djp Junping Du added a comment -

          Agree. IMO, YARN-796 just address more admin labels for each node, but failure group info is already expressed in network topology. Here we need a stateful allocation for a bunch of containers rather than do assignment for each container statelessly, and we need a way to express the requirement of topology relation between containers.

          Show
          djp Junping Du added a comment - Agree. IMO, YARN-796 just address more admin labels for each node, but failure group info is already expressed in network topology. Here we need a stateful allocation for a bunch of containers rather than do assignment for each container statelessly, and we need a way to express the requirement of topology relation between containers.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          Maybe -but now that Hadoop topologies can express failure domains, we could pick that up to say "spread these containers across >1 failure domain" without having to execute any cluster-specific tuning.

          Show
          stevel@apache.org Steve Loughran added a comment - Maybe -but now that Hadoop topologies can express failure domains, we could pick that up to say "spread these containers across >1 failure domain" without having to execute any cluster-specific tuning.
          Hide
          tucu00 Alejandro Abdelnur added a comment -

          it seems much of this could be achieved via YARN-796

          Show
          tucu00 Alejandro Abdelnur added a comment - it seems much of this could be achieved via YARN-796
          Hide
          djp Junping Du added a comment -

          Yes. It is pretty useful in cases specified in description. With this info of affinity and anti-affinity, AM should have knowledge to translate container request to resource request and ask for RM.

          Show
          djp Junping Du added a comment - Yes. It is pretty useful in cases specified in description. With this info of affinity and anti-affinity, AM should have knowledge to translate container request to resource request and ask for RM.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          flagging YARN-896 as needing this

          Show
          stevel@apache.org Steve Loughran added a comment - flagging YARN-896 as needing this

            People

            • Assignee:
              leftnoteasy Wangda Tan
              Reporter:
              stevel@apache.org Steve Loughran
            • Votes:
              5 Vote for this issue
              Watchers:
              59 Start watching this issue

              Dates

              • Created:
                Updated:

                Development