Details

    • Type: Sub-task Sub-task
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 2.4.1
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc.

      We should expose these labels and allow applications to specify labels on resource-requests.

      Obviously we need to support admin operations on adding/removing node labels.

      1. Non-exclusive-Node-Partition-Design.pdf
        200 kB
        Wangda Tan
      2. YARN-796.node-label.consolidate.14.patch
        626 kB
        Wangda Tan
      3. YARN-796.node-label.consolidate.13.patch
        626 kB
        Wangda Tan
      4. YARN-796.node-label.consolidate.12.patch
        609 kB
        Wangda Tan
      5. YARN-796.node-label.consolidate.11.patch
        621 kB
        Wangda Tan
      6. YARN-796.node-label.consolidate.10.patch
        613 kB
        Wangda Tan
      7. YARN-796.node-label.consolidate.8.patch
        547 kB
        Wangda Tan
      8. YARN-796.node-label.consolidate.7.patch
        533 kB
        Wangda Tan
      9. YARN-796.node-label.consolidate.6.patch
        519 kB
        Wangda Tan
      10. YARN-796.node-label.consolidate.5.patch
        506 kB
        Wangda Tan
      11. YARN-796.node-label.consolidate.4.patch
        506 kB
        Wangda Tan
      12. YARN-796.node-label.consolidate.3.patch
        496 kB
        Wangda Tan
      13. YARN-796.node-label.consolidate.2.patch
        475 kB
        Wangda Tan
      14. YARN-796-Diagram.pdf
        54 kB
        Wangda Tan
      15. YARN-796.node-label.consolidate.1.patch
        445 kB
        Wangda Tan
      16. YARN-796.node-label.demo.patch.1
        342 kB
        Wangda Tan
      17. Node-labels-Requirements-Design-doc-V2.pdf
        192 kB
        Wangda Tan
      18. YARN-796.patch4
        205 kB
        Yuliya Feldman
      19. Node-labels-Requirements-Design-doc-V1.pdf
        189 kB
        Wangda Tan
      20. LabelBasedScheduling.pdf
        217 kB
        Yuliya Feldman
      21. YARN-796.patch
        30 kB
        Arun C Murthy

        Issue Links

          Activity

          Hide
          Alejandro Abdelnur added a comment -

          On admins specifying labels to nodes:

          Are you suggesting statically labeling them in via their configs or dynamically via an API? If the later, what would be the API for that? Where would this labeling happen? In the NM (and the NM propagates it to the RM) or directly the RM? Would this labeling be persistent? If yes, where would it be stored?

          On apps specifying labels on resource-requests:

          If I set two labels in a resource-request, would it be an OR or an AND?

          Would these tags play any role i the resource-request key in the scheduler? Meaning, what happen if I send to 2 consecutive requests for for ANY (the latter updating the former) and they have different tags?

          Wondering if we could model this as another resource capability. If we defining addition/subtraction methods for a capability, for capabilities that are 'label' we could have infinity capacity in a node. For example, if the intention is to label nodes with certain architecture you may not care about about the arch capability count for allocation purposes if there is CPU and MEM. And still you could use the requested capabilities for billing purposes.

          Having a concrete use case will help understanding the scope and impact of these changes.

          Show
          Alejandro Abdelnur added a comment - On admins specifying labels to nodes: Are you suggesting statically labeling them in via their configs or dynamically via an API? If the later, what would be the API for that? Where would this labeling happen? In the NM (and the NM propagates it to the RM) or directly the RM? Would this labeling be persistent? If yes, where would it be stored? On apps specifying labels on resource-requests: If I set two labels in a resource-request, would it be an OR or an AND? Would these tags play any role i the resource-request key in the scheduler? Meaning, what happen if I send to 2 consecutive requests for for ANY (the latter updating the former) and they have different tags? Wondering if we could model this as another resource capability. If we defining addition/subtraction methods for a capability, for capabilities that are 'label' we could have infinity capacity in a node. For example, if the intention is to label nodes with certain architecture you may not care about about the arch capability count for allocation purposes if there is CPU and MEM. And still you could use the requested capabilities for billing purposes. Having a concrete use case will help understanding the scope and impact of these changes.
          Hide
          Arun C Murthy added a comment -

          Yes, we'll need to add admin api (via rmadmin) to add/remove labels - obviously we can allow configs for nodes to startup with set of labels which they report during registration.

          Initially, I'm thinking AND is simplest for multiple labels. Consecutive requests, as today, override previous requests.

          Use case, as I mentioned, are ability to segregate clusters based on OS, processor architecture etc. and hence they aren't resource capabilities, rather they are constraints which is what labels try to model explicitly.

          Show
          Arun C Murthy added a comment - Yes, we'll need to add admin api (via rmadmin) to add/remove labels - obviously we can allow configs for nodes to startup with set of labels which they report during registration. Initially, I'm thinking AND is simplest for multiple labels. Consecutive requests, as today, override previous requests. Use case, as I mentioned, are ability to segregate clusters based on OS, processor architecture etc. and hence they aren't resource capabilities, rather they are constraints which is what labels try to model explicitly.
          Hide
          Alejandro Abdelnur added a comment -

          Yes, we'll need to add admin api (via rmadmin) to add/remove labels - obviously we can allow configs for nodes to startup with set of labels which they report during registration.

          Are labels set via the API persisted in the RM? Where?

          When a node registers, how labels are synched between the ones it had in its config and the ones added/removed via the rmadmin.

          Given the usecase you are mentioning it seems these labels are rather static and determined by node characteristics/features. Wouldn't be simpler to start without an rmadmin API and just get them from the nodes on node registration?

          Initially, I'm thinking AND is simplest for multiple labels. Consecutive requests, as today, override previous requests.

          So the labels would not be part of the resource-request key, right?

          Use case, as I mentioned, are ability to segregate clusters based on OS, processor architecture etc. and hence they aren't resource capabilities, rather they are constraints which is what labels try to model explicitly.

          They are expressing capabilities of a node, just capabilities that don't have a quantity that drains. My suggestion for modeling this labels as a resource capability, is that you could use them as a dimension in DRF.

          Show
          Alejandro Abdelnur added a comment - Yes, we'll need to add admin api (via rmadmin) to add/remove labels - obviously we can allow configs for nodes to startup with set of labels which they report during registration. Are labels set via the API persisted in the RM? Where? When a node registers, how labels are synched between the ones it had in its config and the ones added/removed via the rmadmin. Given the usecase you are mentioning it seems these labels are rather static and determined by node characteristics/features. Wouldn't be simpler to start without an rmadmin API and just get them from the nodes on node registration? Initially, I'm thinking AND is simplest for multiple labels. Consecutive requests, as today, override previous requests. So the labels would not be part of the resource-request key, right? Use case, as I mentioned, are ability to segregate clusters based on OS, processor architecture etc. and hence they aren't resource capabilities, rather they are constraints which is what labels try to model explicitly. They are expressing capabilities of a node, just capabilities that don't have a quantity that drains. My suggestion for modeling this labels as a resource capability, is that you could use them as a dimension in DRF.
          Hide
          Alejandro Abdelnur added a comment -

          One thing I forgot to mention before is that labels seem to make sense if the resource-requests location is ANY or for a rack, for resource-requests that are host specific it does not make sense. We would have verify then that aspect of a resource-request on arrival to the RM.

          I guess something this would enable are resource-request like [(location=node1),(location=rack1),(location=ANY,label=wired-to-switch1)]. The topology would be node1 is in rack1 and rack1 is connected to switch-1. The request says, I prefer node1, then rack1 and then a node in another rack if that rack is connected to switch-1.

          A bit more on modeling them as a resource (I have not thought it in full, so just an idea to think).

          Labels are resources which max=1 and a total that is equal to the max number of containers a node may have (driven by mem or cpu, whichever allows less containers). Then a node that has a given label-resource always has capability for it if there is enough capability of memory and/or CPU. Then, a DRF scheduler could use label-resources for 'fair' allocation decisions.

          Another option, away from the resources modeling, would be to model labels similar to location.

          Said all this, I think this is a great idea that opens a new set of allocation possibilities. We should spend some time defining exactly what functionality we want to achieve and then decide if we do that in incremental phases.

          Show
          Alejandro Abdelnur added a comment - One thing I forgot to mention before is that labels seem to make sense if the resource-requests location is ANY or for a rack, for resource-requests that are host specific it does not make sense. We would have verify then that aspect of a resource-request on arrival to the RM. I guess something this would enable are resource-request like [(location=node1),(location=rack1),(location=ANY,label=wired-to-switch1)] . The topology would be node1 is in rack1 and rack1 is connected to switch-1. The request says, I prefer node1, then rack1 and then a node in another rack if that rack is connected to switch-1. A bit more on modeling them as a resource (I have not thought it in full, so just an idea to think). Labels are resources which max=1 and a total that is equal to the max number of containers a node may have (driven by mem or cpu, whichever allows less containers). Then a node that has a given label-resource always has capability for it if there is enough capability of memory and/or CPU. Then, a DRF scheduler could use label-resources for 'fair' allocation decisions. Another option, away from the resources modeling, would be to model labels similar to location. Said all this, I think this is a great idea that opens a new set of allocation possibilities. We should spend some time defining exactly what functionality we want to achieve and then decide if we do that in incremental phases.
          Hide
          Arun C Murthy added a comment -

          One thing I forgot to mention before is that labels seem to make sense if the resource-requests location is ANY or for a rack, for resource-requests that are host specific it does not make sense.

          Agreed, makes sense. We should probably throw a InvalidResourceRequestException if a user tries to make a host-specific RR with a label.

          Show
          Arun C Murthy added a comment - One thing I forgot to mention before is that labels seem to make sense if the resource-requests location is ANY or for a rack, for resource-requests that are host specific it does not make sense. Agreed, makes sense. We should probably throw a InvalidResourceRequestException if a user tries to make a host-specific RR with a label.
          Hide
          Arun C Murthy added a comment -

          Labels are resources which max=1 and a total that is equal to the max number of containers a node may have (driven by mem or cpu, whichever allows less containers). Then a node that has a given label-resource always has capability for it if there is enough capability of memory and/or CPU. Then, a DRF scheduler could use label-resources for 'fair' allocation decisions.

          A node can have a large number of admin-specified labels (os, arch, dept etc. etc.).

          Resources have to be well specified and tangible (cpu, memory, disk, network).

          We don't want to be recompiling PB defs for adding in unknown number of labels.

          Hence, we need to go a separate 'constraint' or 'label' route.

          Show
          Arun C Murthy added a comment - Labels are resources which max=1 and a total that is equal to the max number of containers a node may have (driven by mem or cpu, whichever allows less containers). Then a node that has a given label-resource always has capability for it if there is enough capability of memory and/or CPU. Then, a DRF scheduler could use label-resources for 'fair' allocation decisions. A node can have a large number of admin-specified labels (os, arch, dept etc. etc.). Resources have to be well specified and tangible (cpu, memory, disk, network). We don't want to be recompiling PB defs for adding in unknown number of labels. Hence, we need to go a separate 'constraint' or 'label' route.
          Hide
          Steve Loughran added a comment -

          I'd like to be able to allocate different labels to different queues, so that analytics workloads could go to one set of machines, network ingress/egress applications to another pool. You don't want to add label awareness to these applications, whereas queue-level would seem more appropriate, as it puts the cluster admins in charge

          Show
          Steve Loughran added a comment - I'd like to be able to allocate different labels to different queues, so that analytics workloads could go to one set of machines, network ingress/egress applications to another pool. You don't want to add label awareness to these applications, whereas queue-level would seem more appropriate, as it puts the cluster admins in charge
          Hide
          Max added a comment -

          I would like to add user case for this Feature:
          We are developing platform which can setup different tools for execution on YARN.
          In our industry exists tools which require a specific hardware such as GPU (from Nvidia) and/or FPGA. These tools are not too often executed.
          Setup hardware (GPU/FPGA) on each node will be too expensive and unreasonable due to rare usage.
          The only optimal solution will be to use tags.

          Show
          Max added a comment - I would like to add user case for this Feature: We are developing platform which can setup different tools for execution on YARN. In our industry exists tools which require a specific hardware such as GPU (from Nvidia) and/or FPGA. These tools are not too often executed. Setup hardware (GPU/FPGA) on each node will be too expensive and unreasonable due to rare usage. The only optimal solution will be to use tags.
          Hide
          Arun C Murthy added a comment - - edited

          Back to this, some thoughts:

          • Admin interface
            • Labels are specified by admins (node configuration, dynamic add/remove via rmadmin).
            • Each scheduler (CS, FS) can pick how they want labels specified in their configs
            • Dynamically added labels are, initially, not persisted across RM restarts. So, these need to be manually edited into yarn-site.xml, ACLs into capacity-scheduler.xml etc.
            • By default, all nodes have a default label, but admins can explicitly set a list of labels and drop the default label.
            • Queues have label ACLs i.e. admins can specify, per queue, what labels can be used by applications per queue
          • End-user interface
            • Applications can ask for containers on nodes with specific labels as part of the RR; however, host-specific RRs with labels are illegal i.e. labels are allowed only for rack & * RRs: results in InvalidResourceRequestException
            • RR with a non-existent label (point in time) is illegal: results in InvalidResourceRequestException
            • RR with label without appropriate ACL results in InvalidResourceRequestException (do we want a special InvalidResourceRequestACLException?)
            • Initially, RRs can ask for multiple labels with the expectation that it's an AND operation
          Show
          Arun C Murthy added a comment - - edited Back to this, some thoughts: Admin interface Labels are specified by admins (node configuration, dynamic add/remove via rmadmin). Each scheduler (CS, FS) can pick how they want labels specified in their configs Dynamically added labels are, initially, not persisted across RM restarts. So, these need to be manually edited into yarn-site.xml, ACLs into capacity-scheduler.xml etc. By default, all nodes have a default label, but admins can explicitly set a list of labels and drop the default label. Queues have label ACLs i.e. admins can specify, per queue, what labels can be used by applications per queue End-user interface Applications can ask for containers on nodes with specific labels as part of the RR; however, host-specific RRs with labels are illegal i.e. labels are allowed only for rack & * RRs: results in InvalidResourceRequestException RR with a non-existent label (point in time) is illegal: results in InvalidResourceRequestException RR with label without appropriate ACL results in InvalidResourceRequestException (do we want a special InvalidResourceRequestACLException?) Initially, RRs can ask for multiple labels with the expectation that it's an AND operation
          Hide
          Sandy Ryza added a comment -

          Makes a lot of sense to me. One nit:

          Each scheduler (CS, FS) can pick how they want labels specified in their configs

          Correct me if I'm misunderstanding what you mean here, but currently neither scheduler has node-specific stuff in its configuration. Updating the scheduler config when a node is added or removed from the cluster seems cumbersome. Should labels not be included in the NodeManager configuration like Resources are?

          Show
          Sandy Ryza added a comment - Makes a lot of sense to me. One nit: Each scheduler (CS, FS) can pick how they want labels specified in their configs Correct me if I'm misunderstanding what you mean here, but currently neither scheduler has node-specific stuff in its configuration. Updating the scheduler config when a node is added or removed from the cluster seems cumbersome. Should labels not be included in the NodeManager configuration like Resources are?
          Hide
          Arun C Murthy added a comment -

          Sandy Ryza - Sorry, if it wasn't clear. I meant the ACLs for labels should be specified in each scheduler.

          So, for e.g.:

            <property>
              <name>yarn.scheduler.capacity.root.A.labels</name>
              <value>labelA, labelX</value>
            </property>
          
            <property>
              <name>yarn.scheduler.capacity.root.B.labels</name>
              <value>labelB, labelY</value>
            </property>
          

          Makes sense?

          Show
          Arun C Murthy added a comment - Sandy Ryza - Sorry, if it wasn't clear. I meant the ACLs for labels should be specified in each scheduler. So, for e.g.: <property> <name>yarn.scheduler.capacity.root.A.labels</name> <value>labelA, labelX</value> </property> <property> <name>yarn.scheduler.capacity.root.B.labels</name> <value>labelB, labelY</value> </property> Makes sense?
          Hide
          Alejandro Abdelnur added a comment -

          Arun, doing a recap on the config, is this what you mean?

          ResourceManager yarn-site.xml would specify the valid labels systemwide (you didn't suggest this, but it prevent label typos going unnoticed):

          <property>
            <name>yarn.resourcemanager.valid-labels</name>
            <value>labelA, labelB, labelX</value>
          </properties>
          

          NodeManagers yarn-site.xml would specify the labels of the node:

          <property>
            <name>yarn.nodemanager.labels</name>
            <value>labelA, labelX</value>
          </properties>
          

          Scheduler configuration, in its queue configuration would specify what labels can be used when requesting allocations in that queue:

          <property>
            <name>yarn.scheduler.capacity.root.A.allowed-labels</name>
            <value>labelA</value>
          </properties>
          
          Show
          Alejandro Abdelnur added a comment - Arun, doing a recap on the config, is this what you mean? ResourceManager yarn-site.xml would specify the valid labels systemwide (you didn't suggest this, but it prevent label typos going unnoticed): <property> <name>yarn.resourcemanager.valid-labels</name> <value>labelA, labelB, labelX</value> </properties> NodeManagers yarn-site.xml would specify the labels of the node: <property> <name>yarn.nodemanager.labels</name> <value>labelA, labelX</value> </properties> Scheduler configuration, in its queue configuration would specify what labels can be used when requesting allocations in that queue: <property> <name>yarn.scheduler.capacity.root.A.allowed-labels</name> <value>labelA</value> </properties>
          Hide
          Junping Du added a comment -

          ResourceManager yarn-site.xml would specify the valid labels systemwide (you didn't suggest this, but it prevent label typos going unnoticed):

          I don't think typo of label is a big issue. Restricting labels in RM side potentially prevent to add new label for new application on new registering nodes as we don't have things to refresh yarn-site config dynamically. Isn't it?

          Show
          Junping Du added a comment - ResourceManager yarn-site.xml would specify the valid labels systemwide (you didn't suggest this, but it prevent label typos going unnoticed): I don't think typo of label is a big issue. Restricting labels in RM side potentially prevent to add new label for new application on new registering nodes as we don't have things to refresh yarn-site config dynamically. Isn't it?
          Hide
          Alejandro Abdelnur added a comment -

          scheduler configurations are refreshed dynamically, if the list of valid labels is there, it could be refreshed as well. i would prefer to detect & reject typos from a user experience and troubleshooting point of view.

          Show
          Alejandro Abdelnur added a comment - scheduler configurations are refreshed dynamically, if the list of valid labels is there, it could be refreshed as well. i would prefer to detect & reject typos from a user experience and troubleshooting point of view.
          Hide
          Arun C Murthy added a comment -

          Agree with Alejandro Abdelnur that it's better to reject RRs which ask for non-existent labels. However, to Junping Du's point, I think rather than rely on adding a label in 2 places (RM & individual NM), we are better off having the RM track existing labels dynamically (add/remove from a global list at RM as NMs register/deregister) and reject RRs as they come in.

          So, we get both benefit of strict checking and operational simplicity of having to specify things in only one place i.e. the NM.

          Show
          Arun C Murthy added a comment - Agree with Alejandro Abdelnur that it's better to reject RRs which ask for non-existent labels. However, to Junping Du 's point, I think rather than rely on adding a label in 2 places (RM & individual NM), we are better off having the RM track existing labels dynamically (add/remove from a global list at RM as NMs register/deregister) and reject RRs as they come in. So, we get both benefit of strict checking and operational simplicity of having to specify things in only one place i.e. the NM.
          Hide
          Junping Du added a comment -

          Arun C Murthy, that sounds great. Through this, RM is more stateless on labels as the labels are rebuild through NM registering when RM get restart. The map (holding in RM) from label to NMs can be printed to help for trouble-shooting. The reference counting on labels can guarantee any valid label for RR is supported by specific number of NMs. In addition, for correcting or dynamic adding label info, we may provide NM API to add/update label info that can refresh NM's new label config to RM.

          Show
          Junping Du added a comment - Arun C Murthy , that sounds great. Through this, RM is more stateless on labels as the labels are rebuild through NM registering when RM get restart. The map (holding in RM) from label to NMs can be printed to help for trouble-shooting. The reference counting on labels can guarantee any valid label for RR is supported by specific number of NMs. In addition, for correcting or dynamic adding label info, we may provide NM API to add/update label info that can refresh NM's new label config to RM.
          Hide
          Arun C Murthy added a comment -

          I had the luxury of a long flight... cough

          Here is a very early WIP patch which illustrates an approach, there is a lot of W left in the WIP.

          Show
          Arun C Murthy added a comment - I had the luxury of a long flight... cough Here is a very early WIP patch which illustrates an approach, there is a lot of W left in the WIP.
          Hide
          Wangda Tan added a comment -

          Arun C Murthy, thanks for this patch, I’ve a few questions/suggestions for this implementation. Please forgive the ignorance if my suggestion makes no sense.
          1. Is it good to put SchedulerLabelsManager.add/removeNode invoke to RMNodeImpl transition, which can be leveraged by CS/FS at the same time?
          2. I’m wondering does it make sense to put SchedulerLabelsManager into RMContext?
          3. Do we need consider labels of queue/application (not RR) will affect headroom of an application? (If yes, we need consider linking this to YARN-1198).
          Thanks,

          Show
          Wangda Tan added a comment - Arun C Murthy , thanks for this patch, I’ve a few questions/suggestions for this implementation. Please forgive the ignorance if my suggestion makes no sense. 1. Is it good to put SchedulerLabelsManager.add/removeNode invoke to RMNodeImpl transition, which can be leveraged by CS/FS at the same time? 2. I’m wondering does it make sense to put SchedulerLabelsManager into RMContext? 3. Do we need consider labels of queue/application (not RR) will affect headroom of an application? (If yes, we need consider linking this to YARN-1198 ). Thanks,
          Hide
          Wangda Tan added a comment -

          Working on this JIRA, assigned it to myself. And will post a design doc in a day or two.

          Show
          Wangda Tan added a comment - Working on this JIRA, assigned it to myself. And will post a design doc in a day or two.
          Hide
          Jian Fang added a comment -

          I like to add a use case to this JIRA.

          In a cloud environment, hadoop could run in heterogeneous groups of instances.Take Amazon EMR as an example, usually an EMR hadoop cluster runs in master, core, and task groups, where the task group could be spot instances that can go away at any time. As a result, we like to have a tag capability on each node. That is to say, when a node manager starts up, it will load the tags from the configuration file. Then, the resource manager could refine the scheduling results based on the tags.

          One good example is that we don't want an application master to be assigned to any spot instance in a task group because that instance could be taken away by EC2 at any time.

          If hadoop resource could support a tag capability, then we could extend the current scheduling algorithm to add constraints to not assign the application master to a task node.

          We don't really need any admin capability for the tags (but still good to have) since the tags are static and can be specified in a configuration file, for example yarn-site.xml.

          Show
          Jian Fang added a comment - I like to add a use case to this JIRA. In a cloud environment, hadoop could run in heterogeneous groups of instances.Take Amazon EMR as an example, usually an EMR hadoop cluster runs in master, core, and task groups, where the task group could be spot instances that can go away at any time. As a result, we like to have a tag capability on each node. That is to say, when a node manager starts up, it will load the tags from the configuration file. Then, the resource manager could refine the scheduling results based on the tags. One good example is that we don't want an application master to be assigned to any spot instance in a task group because that instance could be taken away by EC2 at any time. If hadoop resource could support a tag capability, then we could extend the current scheduling algorithm to add constraints to not assign the application master to a task node. We don't really need any admin capability for the tags (but still good to have) since the tags are static and can be specified in a configuration file, for example yarn-site.xml.
          Hide
          Wangda Tan added a comment -

          Jian Fang,
          Really appreciate your use case, we will consider this in design,

          Show
          Wangda Tan added a comment - Jian Fang , Really appreciate your use case, we will consider this in design,
          Hide
          Bikas Saha added a comment -

          Thanks Jian Fang An interesting use case from your comment is that not only can labels be used to specify affinity but they can be used to specify anti-affinity ie. dont place task on a certain label.
          Do I correctly understand as this being your use case?
          OR
          Is your ask that node managers should specify their own labels when they register with the RM instead of the node manager to label mapping being a central RM configuration?

          Show
          Bikas Saha added a comment - Thanks Jian Fang An interesting use case from your comment is that not only can labels be used to specify affinity but they can be used to specify anti-affinity ie. dont place task on a certain label. Do I correctly understand as this being your use case? OR Is your ask that node managers should specify their own labels when they register with the RM instead of the node manager to label mapping being a central RM configuration?
          Hide
          bc Wong added a comment -

          Having the NMs specify their own labels is probably better from an administrative point of view. It's harder for the labels to get out of sync. Each node can have a "discovery script" that updates its labels, which feeds into the NM. So an admin can take a bunch of nodes out for upgrade, and put them back in without having to carefully reconfigure any central mapping file.

          Show
          bc Wong added a comment - Having the NMs specify their own labels is probably better from an administrative point of view. It's harder for the labels to get out of sync. Each node can have a "discovery script" that updates its labels, which feeds into the NM. So an admin can take a bunch of nodes out for upgrade, and put them back in without having to carefully reconfigure any central mapping file.
          Hide
          Jian Fang added a comment -

          Hi Bikas, I think it is better to have the node manager to specify its own labels and then it registers the labels with RM.

          Also, it would be great if YARN could provide an API to add/update labels to a node. This is based on the following scenario.

          Usually a hadoop cluster in cloud is elastic, that is to say, the cluster size can be automatically or manually expended or shrunk based on cluster situation, for example, idleness. When a node in a cluster is chosen to be shrunk, i.e., to be removed, we could call the API to label the node so that no more tasks would be assigned to this node.

          We could use the decommission API to achieve this goal, but I think the label API may be more elegant.

          Show
          Jian Fang added a comment - Hi Bikas, I think it is better to have the node manager to specify its own labels and then it registers the labels with RM. Also, it would be great if YARN could provide an API to add/update labels to a node. This is based on the following scenario. Usually a hadoop cluster in cloud is elastic, that is to say, the cluster size can be automatically or manually expended or shrunk based on cluster situation, for example, idleness. When a node in a cluster is chosen to be shrunk, i.e., to be removed, we could call the API to label the node so that no more tasks would be assigned to this node. We could use the decommission API to achieve this goal, but I think the label API may be more elegant.
          Hide
          Lohit Vijayarenu added a comment -

          As Alejandro Abdelnur mentioned, label sounds closely related to affinity and should be treated less off a resource. It becomes closely related to resources when it comes to exposing them on scheduler queues and exposing that to users who wish to schedule their jobs on certain set of labeled nodes. This is definitely very useful feature to have. Looking forward for design document.

          Show
          Lohit Vijayarenu added a comment - As Alejandro Abdelnur mentioned, label sounds closely related to affinity and should be treated less off a resource. It becomes closely related to resources when it comes to exposing them on scheduler queues and exposing that to users who wish to schedule their jobs on certain set of labeled nodes. This is definitely very useful feature to have. Looking forward for design document.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          YARN-2253 was opened separately which is a dup. Wangda Tan, let's post the design doc so that we can consolidate the discussion? Thanks..

          Show
          Vinod Kumar Vavilapalli added a comment - YARN-2253 was opened separately which is a dup. Wangda Tan , let's post the design doc so that we can consolidate the discussion? Thanks..
          Hide
          Yuliya Feldman added a comment -

          Since YARN-2253 was closed as duplicate. Adding proposal here.
          It would be beneficial to provide label based scheduling where applications can be scheduled on subset of nodes based on logical label expressions that can be specified on Queues and during application submisisons

          Show
          Yuliya Feldman added a comment - Since YARN-2253 was closed as duplicate. Adding proposal here. It would be beneficial to provide label based scheduling where applications can be scheduled on subset of nodes based on logical label expressions that can be specified on Queues and during application submisisons
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Thanks Yuliya Feldman! When Wangda Tan puts up our proposal, we can work towards merging both the docs after having a discussion. We can then figure out the sub-tasks and collaboration.

          Tx again!

          Show
          Vinod Kumar Vavilapalli added a comment - Thanks Yuliya Feldman ! When Wangda Tan puts up our proposal, we can work towards merging both the docs after having a discussion. We can then figure out the sub-tasks and collaboration. Tx again!
          Hide
          bc Wong added a comment -

          Yuliya Feldman & Swapnil Daingade, just read your proposal (LabelBasedScheduling.pdf). Has a few comments:

          1. Would let each node report its own labels. The current proposal specifies the node-label mapping in a centralized file. This seems operationally unfriendly, as the file is hard to maintain.

          • You need to get the DNS name right, which could be hard for a multi-homed setup.
          • The proposal uses regexes on FQDN, such as perfnode.*. This may work if the hostnames are set up by IT like that. But in reality, I've seen lots of sites where the FQDN is like stmp09wk0013.foobar.com, where "stmp" refers to the data center, and "wk0013" refers to "worker 13", and other weird stuff like that. Now imagine that a centralized node-label mapping file with 2000 nodes with such names. It'd be a nightmare.

          Instead, each node can supply its own labels, via yarn.nodemanager.node.labels (which specifies labels directly) or yarn.nodemanager.node.labelFile (which points to a file that has a single line containing all the labels). It's easy to generate the label file for each node. The admin can have puppet push it out, or populate it when the VM is built, or compute it in a local script by inspecting /proc. (Oh I have 192GB, so add the label "largeMem".) There is little room for mistake.

          The NM can still periodically refreshes its own labels, and update the RM via the heartbeat mechanism. The RM should also expose a "node label report", which is the real-time information of all nodes and their labels.

          2. Labels are per-container, not per-app. Right? The doc keeps mentioning "application label", "ApplicationLabelExpression", etc. Should those be "container label" instead? I just want to confirm that each container request can carry its own label expression. Example use case: Only the mappers need GPU, not the reducers.

          3. Can we fail container requests with no satisfying nodes? In "Considerations, #5", you wrote that the app would be in waiting state. Seems that a fail-fast behaviour would be better. If no node can satisfy the label expression, then it's better to tell the client "no". Very likely somebody made a typo somewhere.

          Show
          bc Wong added a comment - Yuliya Feldman & Swapnil Daingade , just read your proposal (LabelBasedScheduling.pdf). Has a few comments: 1. Would let each node report its own labels. The current proposal specifies the node-label mapping in a centralized file. This seems operationally unfriendly, as the file is hard to maintain. You need to get the DNS name right, which could be hard for a multi-homed setup. The proposal uses regexes on FQDN, such as perfnode.* . This may work if the hostnames are set up by IT like that. But in reality, I've seen lots of sites where the FQDN is like stmp09wk0013.foobar.com , where "stmp" refers to the data center, and "wk0013" refers to "worker 13", and other weird stuff like that. Now imagine that a centralized node-label mapping file with 2000 nodes with such names. It'd be a nightmare. Instead, each node can supply its own labels, via yarn.nodemanager.node.labels (which specifies labels directly) or yarn.nodemanager.node.labelFile (which points to a file that has a single line containing all the labels). It's easy to generate the label file for each node. The admin can have puppet push it out, or populate it when the VM is built, or compute it in a local script by inspecting /proc. (Oh I have 192GB, so add the label "largeMem".) There is little room for mistake. The NM can still periodically refreshes its own labels, and update the RM via the heartbeat mechanism. The RM should also expose a "node label report", which is the real-time information of all nodes and their labels. 2. Labels are per-container, not per-app. Right? The doc keeps mentioning "application label", "ApplicationLabelExpression", etc. Should those be "container label" instead? I just want to confirm that each container request can carry its own label expression. Example use case: Only the mappers need GPU, not the reducers. 3. Can we fail container requests with no satisfying nodes? In "Considerations, #5", you wrote that the app would be in waiting state. Seems that a fail-fast behaviour would be better. If no node can satisfy the label expression, then it's better to tell the client "no". Very likely somebody made a typo somewhere.
          Hide
          Allen Wittenauer added a comment -

          b

          Instead, each node can supply its own labels, via yarn.nodemanager.node.labels (which specifies labels directly) or yarn.nodemanager.node.labelFile (which points to a file that has a single line containing all the labels). It's easy to generate the label file for each node.

          Why not just generate this on the node manager a la health check or topology? Provide a hook to actually execute the script or the class and have the NM run it by a user-defined period, including "just at a boot". [... and before it gets asked, yes, certain classes of hardware *do* allow such dynamic change.]

          Show
          Allen Wittenauer added a comment - b Instead, each node can supply its own labels, via yarn.nodemanager.node.labels (which specifies labels directly) or yarn.nodemanager.node.labelFile (which points to a file that has a single line containing all the labels). It's easy to generate the label file for each node. Why not just generate this on the node manager a la health check or topology? Provide a hook to actually execute the script or the class and have the NM run it by a user-defined period, including "just at a boot". [... and before it gets asked, yes, certain classes of hardware *do* allow such dynamic change.]
          Hide
          Yuliya Feldman added a comment -

          bc Wong
          Thank you for your comments

          Regarding:
          The NM can still periodically refreshes its own labels, and update the RM via the heartbeat mechanism. The RM should also expose a "node label report", which is the real-time information of all nodes and their labels.
          Yes - you would have yarn command to "showlabels" that would show all the labels in the cluster
          "yarn rmadmin -showlabels"

          Regarding:
          2. Labels are per-container, not per-app. Right? The doc keeps mentioning "application label", "ApplicationLabelExpression", etc. Should those be "container label" instead? I just want to confirm that each container request can carry its own label expression. Example use case: Only the mappers need GPU, not the reducers.

          Proposal here to have labels per application, not per containers. Though it is not that hard to specify label per container (rather per Request)
          There are pros and cons for both (per container and per app):
          pros for App - the only place to "setLabel" is ApplicationSubmissionContext
          cons for App - as you said - you want one configuration for Mappers and other for Reducers
          cons for container level labels - every application that wants to take advantage of the labels will have to code it in their AppMaster while creating ResourceRequests

          Regarding:
          — The proposal uses regexes on FQDN, such as perfnode.*.

          File with labels does not need to contain Regexes for FQDN - since it will be based solely on what "hostname" what is used in "isBlackListed()" method.
          But I surely open to suggestions to get labels from nodes, as long as it is not high burden on the Cluster Admin who needs to specify labels per node on the node

          Regarding:
          — Can we fail container requests with no satisfying nodes?

          I think it would be the same behavior as for any other Request that can not be satisfied because queues were setup incorrectly, or there is no free resource available t the moment. How would you differentiate between those cases?

          Show
          Yuliya Feldman added a comment - bc Wong Thank you for your comments Regarding: The NM can still periodically refreshes its own labels, and update the RM via the heartbeat mechanism. The RM should also expose a "node label report", which is the real-time information of all nodes and their labels. Yes - you would have yarn command to "showlabels" that would show all the labels in the cluster "yarn rmadmin -showlabels" Regarding: 2. Labels are per-container, not per-app. Right? The doc keeps mentioning "application label", "ApplicationLabelExpression", etc. Should those be "container label" instead? I just want to confirm that each container request can carry its own label expression. Example use case: Only the mappers need GPU, not the reducers. Proposal here to have labels per application, not per containers. Though it is not that hard to specify label per container (rather per Request) There are pros and cons for both (per container and per app): pros for App - the only place to "setLabel" is ApplicationSubmissionContext cons for App - as you said - you want one configuration for Mappers and other for Reducers cons for container level labels - every application that wants to take advantage of the labels will have to code it in their AppMaster while creating ResourceRequests Regarding: — The proposal uses regexes on FQDN, such as perfnode.*. File with labels does not need to contain Regexes for FQDN - since it will be based solely on what "hostname" what is used in "isBlackListed()" method. But I surely open to suggestions to get labels from nodes, as long as it is not high burden on the Cluster Admin who needs to specify labels per node on the node Regarding: — Can we fail container requests with no satisfying nodes? I think it would be the same behavior as for any other Request that can not be satisfied because queues were setup incorrectly, or there is no free resource available t the moment. How would you differentiate between those cases?
          Hide
          Wangda Tan added a comment -

          I've attached the design doc – "Node-labels-Requirements-Design-doc-V1.pdf". This is a doc we're working on, any feedbacks are welcome, we can continuously improve the design doc.

          Thanks,
          Wangda Tan

          Show
          Wangda Tan added a comment - I've attached the design doc – "Node-labels-Requirements-Design-doc-V1.pdf". This is a doc we're working on, any feedbacks are welcome, we can continuously improve the design doc. Thanks, Wangda Tan
          Hide
          Wangda Tan added a comment -

          Hi Yuliya Feldman and Swapnil,
          Thanks for uploading the proposal, I just read it, several comments,

          1. Label Expression

          Label expression - logical combination of labels (using && ­ and, || ­ or, ! ­ not)

          It seems to me the label expression is too complex here, the expression will be verified when we making scheduling decision to allocate every container. We need consider performance.
          Another problem of this is, it will be make harder to calculate headroom of an application or capacity of a queue.
          And it is not so straightforward for user/admin get how many nodes can satisfy a given label expression.
          IMHO, we can simply make node labels AND'ed, most scenarios will be coverer. It will be easier to eval and user can better understand as well.

          2. Queue Policy
          There're 4 policies mentioned in your proposal. We should reduce the complexity of configuration as much as possible.
          At least, "OR" is no so meaningful to me here, do you have any usecase/example on this one?
          I think AND should be enough to cover most usecases.

          3. Labels Manager
          3.1 What's process of modifying the node label configuration? Since the file is stored on DFS, does admin modify the configuration on a local file, then upload it to DFS via "hadoop fs -copyFromLocal ..."? If yes, it will be hard for admin to configure.
          3.2

          We suggest centralized location for node labels such as file stored on DFS that all the YARN daemons

          What's prospect to make it available to all YARN daemons? I think make it available to RM should be enough here.

          4. Specify labels in container level
          I found you plan to add a labels field in ResourceRequest, and also mentioned by Bc Wong. I think we should support container level, user doesn't have to do it, it will be only used when specify labels at app-level is not enough.
          And if we support this, it will be not sufficient to change isBlackListed at AppSchedulingInfo only in scheduler to make fair/capacity scheduler works. We may need to modify implementations of different schedulers.

          5. Label specification for hierarchy queues
          We can only support specify labels in leaf queues, in existing scheduler configuration, like user-limit, etc. can be only specified on leaf queue, we can make them consistent. "The closest will be used." strategy will potentially cause some configuration issues as well.

          6. In "Considerations" part
          6.1

          If we assume that during life of the application none of those changes can take effect on the application

          I think we can assume application will not change label expression during its lifecycle. But updating labels of node/queue should affect future scheduling considerations.
          And even if we assume queue/node labels not changed to an application, we still need to consider node add/remove dynamically in the cluster

          6.2

          When invalid label expression (consists of label(s) that are not present in the labels file) is used to define for Queue or Application it will be ignored as if no label was set. RM logs will have errors about usage of invalid labels

          I think we should tell user this resource request is invalid, we cannot hide this error in RM logs. Because not every user can access logs of YARN daemons.

          6.3

          If no node that satisfies final label evaluation is available Application will be waiting to be submitted.

          In our proposal, AMS will reject if no node satisfies node label of a ResourceRequest. Because user may mis-filling node label in ResourceRequest.
          We may need discuss which one will be better.

          Thanks,
          Wangda Tan

          Show
          Wangda Tan added a comment - Hi Yuliya Feldman and Swapnil, Thanks for uploading the proposal, I just read it, several comments, 1. Label Expression Label expression - logical combination of labels (using && ­ and, || ­ or, ! ­ not) It seems to me the label expression is too complex here, the expression will be verified when we making scheduling decision to allocate every container. We need consider performance. Another problem of this is, it will be make harder to calculate headroom of an application or capacity of a queue. And it is not so straightforward for user/admin get how many nodes can satisfy a given label expression. IMHO, we can simply make node labels AND'ed, most scenarios will be coverer. It will be easier to eval and user can better understand as well. 2. Queue Policy There're 4 policies mentioned in your proposal. We should reduce the complexity of configuration as much as possible. At least, "OR" is no so meaningful to me here, do you have any usecase/example on this one? I think AND should be enough to cover most usecases. 3. Labels Manager 3.1 What's process of modifying the node label configuration? Since the file is stored on DFS, does admin modify the configuration on a local file, then upload it to DFS via "hadoop fs -copyFromLocal ..."? If yes, it will be hard for admin to configure. 3.2 We suggest centralized location for node labels such as file stored on DFS that all the YARN daemons What's prospect to make it available to all YARN daemons? I think make it available to RM should be enough here. 4. Specify labels in container level I found you plan to add a labels field in ResourceRequest, and also mentioned by Bc Wong. I think we should support container level, user doesn't have to do it, it will be only used when specify labels at app-level is not enough. And if we support this, it will be not sufficient to change isBlackListed at AppSchedulingInfo only in scheduler to make fair/capacity scheduler works. We may need to modify implementations of different schedulers. 5. Label specification for hierarchy queues We can only support specify labels in leaf queues, in existing scheduler configuration, like user-limit, etc. can be only specified on leaf queue, we can make them consistent. "The closest will be used." strategy will potentially cause some configuration issues as well. 6. In "Considerations" part 6.1 If we assume that during life of the application none of those changes can take effect on the application I think we can assume application will not change label expression during its lifecycle. But updating labels of node/queue should affect future scheduling considerations. And even if we assume queue/node labels not changed to an application, we still need to consider node add/remove dynamically in the cluster 6.2 When invalid label expression (consists of label(s) that are not present in the labels file) is used to define for Queue or Application it will be ignored as if no label was set. RM logs will have errors about usage of invalid labels I think we should tell user this resource request is invalid, we cannot hide this error in RM logs. Because not every user can access logs of YARN daemons. 6.3 If no node that satisfies final label evaluation is available Application will be waiting to be submitted. In our proposal, AMS will reject if no node satisfies node label of a ResourceRequest. Because user may mis-filling node label in ResourceRequest. We may need discuss which one will be better. Thanks, Wangda Tan
          Hide
          Sandy Ryza added a comment -

          +1 on reducing the complexity of the label predicates. We should only use OR if we can think of a few concrete use cases where we would need it.

          Show
          Sandy Ryza added a comment - +1 on reducing the complexity of the label predicates. We should only use OR if we can think of a few concrete use cases where we would need it.
          Hide
          Yuliya Feldman added a comment -

          Wangda Tan Thank you for your comments.
          BTW - Your document is a great set of requirements, it was a real pleasure reading it.

          Please see my answers.

          1. Label Expression
          >>>>> Label expression - logical combination of labels (using && ­ and, || ­ or, ! ­ not)
          It seems to me the label expression is too complex here, the expression will be verified when we making scheduling decision to allocate every container. We need consider performance.

          I definitely agree that performance needs to be considered here.
          Application level label expression is not going to change per application lifetime - so this can be cashed.
          Queue level label expression is going to change only when Queue label is changed - so this can be cached per queue
          So final expression to match together Queue Label, Application Label Expressions and QueueLabelPolicy does not need to be evaluated every ResourceRequest - again unless AppMaster dynamically assigns different labels per request for a container.

          What probably needs to be evaluated is what nodes satisfy a final/effective LabelExpression, as nodes can come and go, labels on them can change

          >>>> Another problem of this is, it will be make harder to calculate headroom of an application or capacity of a queue.
          Thank you for pointing this out. I will double check on this

          >>>> And it is not so straightforward for user/admin get how many nodes can satisfy a given label expression.
          I am sure we can provide admin API/REST/UI to enter expression and get the result

          >>>> IMHO, we can simply make node labels AND'ed, most scenarios will be coverer. It will be easier to eval and user can better understand as well.
          Let me understand it better: If application provides multiple labels they are "AND"ed and so only nodes that have the same set of labels or their superset will be used?

          2. Queue Policy
          >>>> There're 4 policies mentioned in your proposal. We should reduce the complexity of configuration as much as possible.
          >>>> At least, "OR" is no so meaningful to me here, do you have any usecase/example on this one?
          Consider this as union of LabelExpression from Application and Queue. So if you have LabelExpression as "blue" and QueueExpression as "yellow"
          You can allocate containers on the nodes that have either label "blue" or "yellow" (if you have some nodes that are not marked as such they won't be used). So unlike in case of "AND" where you can only run on nodes that marked as "blue" and "yellow" (subset)

          I think AND should be enough to cover most usecases.
          3. Labels Manager
          >>>> 3.1 What's process of modifying the node label configuration? Since the file is stored on DFS, does admin modify the configuration on a local file, then upload it to DFS via "hadoop fs -copyFromLocal ..."? If yes, it will be hard for admin to configure.
          Yes - so far this is a procedure. Not sure what is "hard" here, but we can have some API to do it.

          3.2
          We suggest centralized location for node labels such as file stored on DFS that all the YARN daemons
          >>>> What's prospect to make it available to all YARN daemons? I think make it available to RM should be enough here.
          Agree - that today this file may be only relevant to RM. If it is stored as local file or by other means it is greater chance for it to be overwritten, lost in upgrade process.

          4. Specify labels in container level
          I found you plan to add a labels field in ResourceRequest, and also mentioned by Bc Wong. I think we should support container level, user doesn't have to do it, it will be only used when specify labels at app-level is not enough.
          >>>> Yes - if Application Level is not enough user can specify on request level, otherwise not necessarily. Though I can not say we looked closely at possibility of setting label on more granular level very closely (to address your next comment)

          And if we support this, it will be not sufficient to change isBlackListed at AppSchedulingInfo only in scheduler to make fair/capacity scheduler works. We may need to modify implementations of different schedulers.
          5. Label specification for hierarchy queues
          We can only support specify labels in leaf queues, in existing scheduler configuration, like user-limit, etc. can be only specified on leaf queue, we can make them consistent. "The closest will be used." strategy will potentially cause some configuration issues as well.
          >>>> Sure we can make them consistent, our thought process was that if you have multiple leaf queues that should share the same label/policy you can specify it on the parent level, so you don't need to "type" more then necessary
          6. In "Considerations" part
          6.1
          If we assume that during life of the application none of those changes can take effect on the application
          >>>> I think we can assume application will not change label expression during its lifecycle. But updating labels of node/queue should affect future scheduling considerations.
          Yes, definitely application label expression is not going to change over application run time, I was referring to reevaluation of "final" label expression matching nodes for performance reasons.

          And even if we assume queue/node labels not changed to an application, we still need to consider node add/remove dynamically in the cluster
          6.2
          >>>> When invalid label expression (consists of label(s) that are not present in the labels file) is used to define for Queue or Application it will be ignored as if no label was set. RM logs will have errors about usage of invalid labels
          >>>> I think we should tell user this resource request is invalid, we cannot hide this error in RM logs. Because not every user can access logs of YARN daemons.
          Completely agree that this needs to be propagated to the end user in some shape or form. Would love to hear your proposal in this area

          6.3
          If no node that satisfies final label evaluation is available Application will be waiting to be submitted.
          >>>> In our proposal, AMS will reject if no node satisfies node label of a ResourceRequest. Because user may mis-filling node label in ResourceRequest.
          >>>> We may need discuss which one will be better.
          Absolutely - let's discuss it.

          Show
          Yuliya Feldman added a comment - Wangda Tan Thank you for your comments. BTW - Your document is a great set of requirements, it was a real pleasure reading it. Please see my answers. 1. Label Expression >>>>> Label expression - logical combination of labels (using && ­ and, || ­ or, ! ­ not) It seems to me the label expression is too complex here, the expression will be verified when we making scheduling decision to allocate every container. We need consider performance. I definitely agree that performance needs to be considered here. Application level label expression is not going to change per application lifetime - so this can be cashed. Queue level label expression is going to change only when Queue label is changed - so this can be cached per queue So final expression to match together Queue Label, Application Label Expressions and QueueLabelPolicy does not need to be evaluated every ResourceRequest - again unless AppMaster dynamically assigns different labels per request for a container. What probably needs to be evaluated is what nodes satisfy a final/effective LabelExpression, as nodes can come and go, labels on them can change >>>> Another problem of this is, it will be make harder to calculate headroom of an application or capacity of a queue. Thank you for pointing this out. I will double check on this >>>> And it is not so straightforward for user/admin get how many nodes can satisfy a given label expression. I am sure we can provide admin API/REST/UI to enter expression and get the result >>>> IMHO, we can simply make node labels AND'ed, most scenarios will be coverer. It will be easier to eval and user can better understand as well. Let me understand it better: If application provides multiple labels they are "AND"ed and so only nodes that have the same set of labels or their superset will be used? 2. Queue Policy >>>> There're 4 policies mentioned in your proposal. We should reduce the complexity of configuration as much as possible. >>>> At least, "OR" is no so meaningful to me here, do you have any usecase/example on this one? Consider this as union of LabelExpression from Application and Queue. So if you have LabelExpression as "blue" and QueueExpression as "yellow" You can allocate containers on the nodes that have either label "blue" or "yellow" (if you have some nodes that are not marked as such they won't be used). So unlike in case of "AND" where you can only run on nodes that marked as "blue" and "yellow" (subset) I think AND should be enough to cover most usecases. 3. Labels Manager >>>> 3.1 What's process of modifying the node label configuration? Since the file is stored on DFS, does admin modify the configuration on a local file, then upload it to DFS via "hadoop fs -copyFromLocal ..."? If yes, it will be hard for admin to configure. Yes - so far this is a procedure. Not sure what is "hard" here, but we can have some API to do it. 3.2 We suggest centralized location for node labels such as file stored on DFS that all the YARN daemons >>>> What's prospect to make it available to all YARN daemons? I think make it available to RM should be enough here. Agree - that today this file may be only relevant to RM. If it is stored as local file or by other means it is greater chance for it to be overwritten, lost in upgrade process. 4. Specify labels in container level I found you plan to add a labels field in ResourceRequest, and also mentioned by Bc Wong. I think we should support container level, user doesn't have to do it, it will be only used when specify labels at app-level is not enough. >>>> Yes - if Application Level is not enough user can specify on request level, otherwise not necessarily. Though I can not say we looked closely at possibility of setting label on more granular level very closely (to address your next comment) And if we support this, it will be not sufficient to change isBlackListed at AppSchedulingInfo only in scheduler to make fair/capacity scheduler works. We may need to modify implementations of different schedulers. 5. Label specification for hierarchy queues We can only support specify labels in leaf queues, in existing scheduler configuration, like user-limit, etc. can be only specified on leaf queue, we can make them consistent. "The closest will be used." strategy will potentially cause some configuration issues as well. >>>> Sure we can make them consistent, our thought process was that if you have multiple leaf queues that should share the same label/policy you can specify it on the parent level, so you don't need to "type" more then necessary 6. In "Considerations" part 6.1 If we assume that during life of the application none of those changes can take effect on the application >>>> I think we can assume application will not change label expression during its lifecycle. But updating labels of node/queue should affect future scheduling considerations. Yes, definitely application label expression is not going to change over application run time, I was referring to reevaluation of "final" label expression matching nodes for performance reasons. And even if we assume queue/node labels not changed to an application, we still need to consider node add/remove dynamically in the cluster 6.2 >>>> When invalid label expression (consists of label(s) that are not present in the labels file) is used to define for Queue or Application it will be ignored as if no label was set. RM logs will have errors about usage of invalid labels >>>> I think we should tell user this resource request is invalid, we cannot hide this error in RM logs. Because not every user can access logs of YARN daemons. Completely agree that this needs to be propagated to the end user in some shape or form. Would love to hear your proposal in this area 6.3 If no node that satisfies final label evaluation is available Application will be waiting to be submitted. >>>> In our proposal, AMS will reject if no node satisfies node label of a ResourceRequest. Because user may mis-filling node label in ResourceRequest. >>>> We may need discuss which one will be better. Absolutely - let's discuss it.
          Hide
          amit hadke added a comment -

          AND is a rare case, probably will never be used. OR is a good choice by default.
          I will strongly advise supporting NOT.
          Example:
          Run reduce task on any machine which is not labelled 'production'.
          Gives include/exclude functionality.

          Show
          amit hadke added a comment - AND is a rare case, probably will never be used. OR is a good choice by default. I will strongly advise supporting NOT. Example: Run reduce task on any machine which is not labelled 'production'. Gives include/exclude functionality.
          Hide
          Wangda Tan added a comment -

          Reply:
          Hi Yuliya,
          Thanks for your reply. it’s great to read your doc and discuss with you too.
          Please see my reply below.

          1)

          What probably needs to be evaluated is what nodes satisfy a final/effective LabelExpression, as nodes can come and go, labels on them can change

          Agree, what I meant is, we need consider performance of 2 things,

          • Time to evaluate a label expression, IMO we need to add labels in per container level.
          • If it is important to get headroom or how many nodes can be used for an expression. The easier expression will be easier for us to get result mentioned previously easier.

          2)

          Let me understand it better: If application provides multiple labels they are "AND"ed and so only nodes that have the same set of labels or their superset will be used?

          Yes,
          Why I think this is important because label is treat as a tangible resource here. Imaging you running a HBase master, you may want the node is “stable”, “large_memory”, “for_long_running_service”. Or you try to run a scientific computing program, you want a node has “GPU”, “large_memory”, “strong_cpu”. It is not make sense to use “OR” in these cases.

          To Sandy/Amit, do you have any specific use case for OR?
          My basic feeling to support different OPs like “OR”/“NOT” here is, we may support different OPs if they have clear use case and highly demanded. But we’d better not use a combined expression. If we use combined expression, we need to add parentheses here, which will increase complexity to evaluate them.
          Let's hear more thoughts from community about this.

          3)

          Yes - so far this is a procedure. Not sure what is "hard" here, but we can have some API to do it.

          Do you have any ideas about what’s the API will like?

          4)

          Agree - that today this file may be only relevant to RM. If it is stored as local file or by other means it is greater chance for it to be overwritten, lost in upgrade process.

          Agree

          5)

          And if we support this, it will be not sufficient to change isBlackListed at AppSchedulingInfo only in scheduler to make fair/capacity scheduler works. We may need to modify implementations of different schedulers.

          Agree

          6)

          Sure we can make them consistent, our thought process was that if you have multiple leaf queues that should share the same label/policy you can specify it on the parent level, so you don't need to "type" more then necessary

          I think for different schedulers, we should specify queue related parameters in different configurations. Let’s get more ideas about how to specify queue parameters from community before move ahead.

          Thanks,
          Wangda

          Show
          Wangda Tan added a comment - Reply: Hi Yuliya, Thanks for your reply. it’s great to read your doc and discuss with you too. Please see my reply below. 1) What probably needs to be evaluated is what nodes satisfy a final/effective LabelExpression, as nodes can come and go, labels on them can change Agree, what I meant is, we need consider performance of 2 things, Time to evaluate a label expression, IMO we need to add labels in per container level. If it is important to get headroom or how many nodes can be used for an expression. The easier expression will be easier for us to get result mentioned previously easier. 2) Let me understand it better: If application provides multiple labels they are "AND"ed and so only nodes that have the same set of labels or their superset will be used? Yes, Why I think this is important because label is treat as a tangible resource here. Imaging you running a HBase master, you may want the node is “stable”, “large_memory”, “for_long_running_service”. Or you try to run a scientific computing program, you want a node has “GPU”, “large_memory”, “strong_cpu”. It is not make sense to use “OR” in these cases. To Sandy/Amit, do you have any specific use case for OR? My basic feeling to support different OPs like “OR”/“NOT” here is, we may support different OPs if they have clear use case and highly demanded. But we’d better not use a combined expression. If we use combined expression, we need to add parentheses here, which will increase complexity to evaluate them. Let's hear more thoughts from community about this. 3) Yes - so far this is a procedure. Not sure what is "hard" here, but we can have some API to do it. Do you have any ideas about what’s the API will like? 4) Agree - that today this file may be only relevant to RM. If it is stored as local file or by other means it is greater chance for it to be overwritten, lost in upgrade process. Agree 5) And if we support this, it will be not sufficient to change isBlackListed at AppSchedulingInfo only in scheduler to make fair/capacity scheduler works. We may need to modify implementations of different schedulers. Agree 6) Sure we can make them consistent, our thought process was that if you have multiple leaf queues that should share the same label/policy you can specify it on the parent level, so you don't need to "type" more then necessary I think for different schedulers, we should specify queue related parameters in different configurations. Let’s get more ideas about how to specify queue parameters from community before move ahead. Thanks, Wangda
          Hide
          Yuliya Feldman added a comment -

          1)

          Agree, what I meant is, we need consider performance of 2 things,

          • Time to evaluate a label expression, IMO we need to add labels in per container level.
          • If it is important to get headroom or how many nodes can be used for an expression. The easier expression will be easier for us to get result mentioned previously easier.

          Regarding time to evaluate label expression - we need to get some performance stats on how many ops we can process - I will try to get those performance numbers based different levels complexity of expression
          Did not do anything to include labels evaluation into calculation of headroom, so I don't have comments there

          2)

          Do you have any ideas about what’s the API will like?

          It can be as simple as "yarn rmadmin -loadlabels <local_file_path> <remote_file_path>"
          I am not sure if you mean anything else

          3)

          I think for different schedulers, we should specify queue related parameters in different configurations. Let’s get more ideas about how to specify queue parameters from community before move ahead.

          I have some examples in the document for Fair and Capacity Schedulers

          Show
          Yuliya Feldman added a comment - 1) Agree, what I meant is, we need consider performance of 2 things, Time to evaluate a label expression, IMO we need to add labels in per container level. If it is important to get headroom or how many nodes can be used for an expression. The easier expression will be easier for us to get result mentioned previously easier. Regarding time to evaluate label expression - we need to get some performance stats on how many ops we can process - I will try to get those performance numbers based different levels complexity of expression Did not do anything to include labels evaluation into calculation of headroom, so I don't have comments there 2) Do you have any ideas about what’s the API will like? It can be as simple as "yarn rmadmin -loadlabels <local_file_path> <remote_file_path>" I am not sure if you mean anything else 3) I think for different schedulers, we should specify queue related parameters in different configurations. Let’s get more ideas about how to specify queue parameters from community before move ahead. I have some examples in the document for Fair and Capacity Schedulers
          Hide
          Wangda Tan added a comment -

          I have some examples in the document for Fair and Capacity Schedulers

          Thanks for pointing me this, will take a look at it.

          Show
          Wangda Tan added a comment - I have some examples in the document for Fair and Capacity Schedulers Thanks for pointing me this, will take a look at it.
          Hide
          Jian Fang added a comment -

          In our environment, most likely the label condition will be OR, not AND. But it is good to support basic logic such as AND, OR, and NOT.

          Users may like to allocate application masters to nodes only with specific labels. This is a special use case because the AM container is actually launched by hadoop itself, not the user. You may like to add some parameters such as "yarn.app.mapreduce.am.labels" so that hadoop will honor this parameter. You may also like to add an option like "yarn.label.enabled" to turn on and off the label feature.

          Why do users have to choose either decentralized or centralized label configuration? The labels could be static and dynamic. The static ones should be loaded from yarn-site.xml on each node and the dynamic ones should be specified by a restful API or ADMIN. To me, the restful API could be more useful than the Admin UI. For example, everything is automated for clusters in a cloud and no manual work in most cases. As a result, I would rather to have a restful API to update the labels on a node directly through node manager, which will in return sync with resource manager. Or the API will update both resource manager and the node manager if the sync time is a problem here.

          Show
          Jian Fang added a comment - In our environment, most likely the label condition will be OR, not AND. But it is good to support basic logic such as AND, OR, and NOT. Users may like to allocate application masters to nodes only with specific labels. This is a special use case because the AM container is actually launched by hadoop itself, not the user. You may like to add some parameters such as "yarn.app.mapreduce.am.labels" so that hadoop will honor this parameter. You may also like to add an option like "yarn.label.enabled" to turn on and off the label feature. Why do users have to choose either decentralized or centralized label configuration? The labels could be static and dynamic. The static ones should be loaded from yarn-site.xml on each node and the dynamic ones should be specified by a restful API or ADMIN. To me, the restful API could be more useful than the Admin UI. For example, everything is automated for clusters in a cloud and no manual work in most cases. As a result, I would rather to have a restful API to update the labels on a node directly through node manager, which will in return sync with resource manager. Or the API will update both resource manager and the node manager if the sync time is a problem here.
          Hide
          Sunil G added a comment -

          Hi Wangda Tan (No longer used)
          Great. This feature will be a big addition to YARN.

          I have few thoughts on this.

          1. In our use case scenarios, we are more likely to have OR and NOT. I feel combination of these labels need to be in a defined or restricted way. Result of some combinations (AND, OR and NOT) may come invalid, and some may need to be reduced. This complexity need not have to bring to RM to take a final decision.
          2. Reservation: If a node label has many nodes under it, then there is a chance of reservation. Valid candidates may come later, so solution can be look in to this aspect also. Node Label level reservations ?
          3. Centralized Configuration: If a new node is added to cluster, may be it can be started by having a label configuration in its yarn-site.xml. This may be fine I feel. your thoughts?

          Show
          Sunil G added a comment - Hi Wangda Tan (No longer used) Great. This feature will be a big addition to YARN. I have few thoughts on this. 1. In our use case scenarios, we are more likely to have OR and NOT. I feel combination of these labels need to be in a defined or restricted way. Result of some combinations (AND, OR and NOT) may come invalid, and some may need to be reduced. This complexity need not have to bring to RM to take a final decision. 2. Reservation : If a node label has many nodes under it, then there is a chance of reservation. Valid candidates may come later, so solution can be look in to this aspect also. Node Label level reservations ? 3. Centralized Configuration: If a new node is added to cluster, may be it can be started by having a label configuration in its yarn-site.xml. This may be fine I feel. your thoughts?
          Hide
          Wangda Tan added a comment -

          Hi Jian Fang,
          Thanks for providing use cases.

          Why do users have to choose either decentralized or centralized label configuration?

          This is because cases like user may what to remove some static labels via dynamic API, and for next time RM restart, it will load static labels again. It will be hard to manage static/dynamic together, we need handling conflicts, etc.

          To me, the restful API could be more useful than the Admin UI.

          I think both of them are very important in normal cases. RESTful API can be used by other management framework. Admin UI can be directly used by admin to tagging nodes.

          Show
          Wangda Tan added a comment - Hi Jian Fang , Thanks for providing use cases. Why do users have to choose either decentralized or centralized label configuration? This is because cases like user may what to remove some static labels via dynamic API, and for next time RM restart, it will load static labels again. It will be hard to manage static/dynamic together, we need handling conflicts, etc. To me, the restful API could be more useful than the Admin UI. I think both of them are very important in normal cases. RESTful API can be used by other management framework. Admin UI can be directly used by admin to tagging nodes.
          Hide
          Wangda Tan added a comment -

          Hi Sunil G,
          Thanks for reply,

          1. In our use case scenarios, we are more likely to have OR and NOT. I feel combination of these labels need to be in a defined or restricted way. Result of some combinations (AND, OR and NOT) may come invalid, and some may need to be reduced. This complexity need not have to bring to RM to take a final decision.

          Agree that we need some restricted way, we need think harder about this

          2. Reservation: If a node label has many nodes under it, then there is a chance of reservation. Valid candidates may come later, so solution can be look in to this aspect also. Node Label level reservations ?

          I haven't thought about this before, I'll think about it, thanks for reminding me

          3. Centralized Configuration: If a new node is added to cluster, may be it can be started by having a label configuration in its yarn-site.xml. This may be fine I feel. your thoughts?

          I think this is more like a decentralized configuration in your description. For centralized configuration, I think maybe there's a "node label repo" which stores mapping of nodes to labels. And we will provide RESTful API for changing them.

          Thanks,
          Wangda

          Show
          Wangda Tan added a comment - Hi Sunil G , Thanks for reply, 1. In our use case scenarios, we are more likely to have OR and NOT. I feel combination of these labels need to be in a defined or restricted way. Result of some combinations (AND, OR and NOT) may come invalid, and some may need to be reduced. This complexity need not have to bring to RM to take a final decision. Agree that we need some restricted way, we need think harder about this 2. Reservation: If a node label has many nodes under it, then there is a chance of reservation. Valid candidates may come later, so solution can be look in to this aspect also. Node Label level reservations ? I haven't thought about this before, I'll think about it, thanks for reminding me 3. Centralized Configuration: If a new node is added to cluster, may be it can be started by having a label configuration in its yarn-site.xml. This may be fine I feel. your thoughts? I think this is more like a decentralized configuration in your description. For centralized configuration, I think maybe there's a "node label repo" which stores mapping of nodes to labels. And we will provide RESTful API for changing them. Thanks, Wangda
          Hide
          Sunil G added a comment -

          2. Regarding reservations, how about introducing node-label reservations. Ideas is like, if an application is lacking resource on a node, it can reserve on that node as well as to node-label. So when a suitable node update comes from another node in same node-label, can try allocating container in new node by unreserving from old node.

          3. My approach was more like have a centralized configuration, but later after some time, if want to add a new node to cluster, then it can start with a hardcoded label in its yarn-site. In your approach, we need to use REStful API or admin command to bring this node under one label. May be while start up itself this node can be set under a label. your thoughts?

          Show
          Sunil G added a comment - 2. Regarding reservations, how about introducing node-label reservations. Ideas is like, if an application is lacking resource on a node, it can reserve on that node as well as to node-label. So when a suitable node update comes from another node in same node-label, can try allocating container in new node by unreserving from old node. 3. My approach was more like have a centralized configuration, but later after some time, if want to add a new node to cluster, then it can start with a hardcoded label in its yarn-site. In your approach, we need to use REStful API or admin command to bring this node under one label. May be while start up itself this node can be set under a label. your thoughts?
          Hide
          Sandy Ryza added a comment -

          I'm worried that the proposal is becoming too complex. Can we try to whittle the proposal down to a minimum viable feature? I'm not necessarily opposed to the more advanced parts of it like queue label policies and updating labels on the fly, and the design should aim to make them possible in the future, but I don't think they need to be part of the initial implementation.

          To me it seems like the essential requirements here are:

          • A way for nodes to be tagged with labels
          • A way to make scheduling requests based on these labels

          I'm also skeptical about the need for adding/removing labels dynamically. Do we have concrete use cases for this?

          Lastly, as BC and Sunil have pointed out, specifying the labels in the NodeManager confs greatly simplifies configuration when nodes are being added. Are there advantages to a centralized configuration?

          Show
          Sandy Ryza added a comment - I'm worried that the proposal is becoming too complex. Can we try to whittle the proposal down to a minimum viable feature? I'm not necessarily opposed to the more advanced parts of it like queue label policies and updating labels on the fly, and the design should aim to make them possible in the future, but I don't think they need to be part of the initial implementation. To me it seems like the essential requirements here are: A way for nodes to be tagged with labels A way to make scheduling requests based on these labels I'm also skeptical about the need for adding/removing labels dynamically. Do we have concrete use cases for this? Lastly, as BC and Sunil have pointed out, specifying the labels in the NodeManager confs greatly simplifies configuration when nodes are being added. Are there advantages to a centralized configuration?
          Hide
          Jian Fang added a comment -

          As Sandy pointed out, seems the scope becomes bigger and bigger. Take our use case as an example, we initial only need to restrict Application masters not be assigned to some nodes such as spot instances in EC2. In our design, we only added the following parameters

          yarn.label.enabled
          yarn.nodemanager.labels
          yarn.app.mapreduce.am.labels

          to yarn-site.xml and then modified hadoop code. This function works now. With the current proposal, I wonder how long it may take to finish.

          I also doubt about the assumption for admin to configure labels for a cluster. Usually a cluster comes with hundreds or thousands of nodes, how possible for the admin to manually configure the labels? This type of work can be easily automated by some script or a java process running on each node to write the labels such as OS, processor, and other parameters to yarn-site.xml before the cluster is started. This is especially true for clusters in a cloud because everything is automated there. The admin UI could only be used in some special cases that require human intervention.

          One use case for dynamic labeling is that we can put a label to a node when we try to shrink a cluster so that hadoop will not assign tasks to that node any more to give that node some grace time to be decommissioned. This is most likely to be implemented by a restful API call from a process that chooses a node to remove based on cluster metrics of the cluster.

          Show
          Jian Fang added a comment - As Sandy pointed out, seems the scope becomes bigger and bigger. Take our use case as an example, we initial only need to restrict Application masters not be assigned to some nodes such as spot instances in EC2. In our design, we only added the following parameters yarn.label.enabled yarn.nodemanager.labels yarn.app.mapreduce.am.labels to yarn-site.xml and then modified hadoop code. This function works now. With the current proposal, I wonder how long it may take to finish. I also doubt about the assumption for admin to configure labels for a cluster. Usually a cluster comes with hundreds or thousands of nodes, how possible for the admin to manually configure the labels? This type of work can be easily automated by some script or a java process running on each node to write the labels such as OS, processor, and other parameters to yarn-site.xml before the cluster is started. This is especially true for clusters in a cloud because everything is automated there. The admin UI could only be used in some special cases that require human intervention. One use case for dynamic labeling is that we can put a label to a node when we try to shrink a cluster so that hadoop will not assign tasks to that node any more to give that node some grace time to be decommissioned. This is most likely to be implemented by a restful API call from a process that chooses a node to remove based on cluster metrics of the cluster.
          Hide
          Allen Wittenauer added a comment -

          I agree pretty much completely with everything Sandy said, especially on the centralized configuration. It actually makes configuration harder for heterogeneous node setups.

          One caveat:

          I'm also skeptical about the need for adding/removing labels dynamically. Do we have concrete use cases for this?
          

          If you have the nodemanager push the labels to the RM (esp if you can do this via user defined script or java class...), you basically have to have dynamic labels for nodes. Use cases are pretty easy to hit if you label nodes based upon the software stack installed. A quick example for those not following:

          1. User writes software that depends upon a particular version of libfoo.so.2.
          2. Configuration management does an install of libfoo.so.2
          3. NodeManager label script picks up that it has both libfoo.so.1 and libfoo.so.2. Publishes that it now has "libfoo1" and "libfoo2". (Remember, this is C and not the screwed up Java universe so having two versions is completely legitimate)
          4. system can now do operations appropriate for either libfoo on that node.
          5. libfoo1 gets deprecated and removed from the system, again via configuration management.
          6. label script picks up change and removes libfoo1 from label listing
          7. system acts appropriately and no longer does operations on node based upon libfoo1 label

          ... and all without restarting or reconfiguring anything on the Hadoop side. If there is any sort of manual step required in configuration the nodes short of the initial label script/class and other obviously user-provided bits, then we've failed.

          Show
          Allen Wittenauer added a comment - I agree pretty much completely with everything Sandy said, especially on the centralized configuration. It actually makes configuration harder for heterogeneous node setups. One caveat: I'm also skeptical about the need for adding/removing labels dynamically. Do we have concrete use cases for this ? If you have the nodemanager push the labels to the RM (esp if you can do this via user defined script or java class...), you basically have to have dynamic labels for nodes. Use cases are pretty easy to hit if you label nodes based upon the software stack installed. A quick example for those not following: User writes software that depends upon a particular version of libfoo.so.2. Configuration management does an install of libfoo.so.2 NodeManager label script picks up that it has both libfoo.so.1 and libfoo.so.2. Publishes that it now has "libfoo1" and "libfoo2". (Remember, this is C and not the screwed up Java universe so having two versions is completely legitimate) system can now do operations appropriate for either libfoo on that node. libfoo1 gets deprecated and removed from the system, again via configuration management. label script picks up change and removes libfoo1 from label listing system acts appropriately and no longer does operations on node based upon libfoo1 label ... and all without restarting or reconfiguring anything on the Hadoop side. If there is any sort of manual step required in configuration the nodes short of the initial label script/class and other obviously user-provided bits, then we've failed.
          Hide
          Alejandro Abdelnur added a comment -

          i agree with sandy and allen.

          said that, we currently dont do any thing centralized on per nodemanager basis, if we want to so that we should think solving it in a more general way than just labels. and i would suggest doing that (if we decide to) in a diff jira.

          Show
          Alejandro Abdelnur added a comment - i agree with sandy and allen. said that, we currently dont do any thing centralized on per nodemanager basis, if we want to so that we should think solving it in a more general way than just labels. and i would suggest doing that (if we decide to) in a diff jira.
          Hide
          Wangda Tan added a comment -

          Really thanks all your comments above,

          As Sandy, Alejandro and Allen mentioned, concerns of centralized configuration. My thinking is, node label is more dynamic comparing to any other existing options of NM.
          An important use case we can see is, some customers want to mark label on each node indicate which department/team the node belongs to, when a new team comes in and new machines added, labels may need to be changed. And also, it is possible that the whole cluster is booked to run some huge batch job at 12am-2am for example. So such labels will be changed frequently. If we only have distributed configuration on each node, it is a nightmare for admins to re-configure.
          I think we should have a same internal interface for destributed/centralized configuration. Like what we've done for RMStateStore.

          And as Jian Fang mentioned,

          doubt about the assumption for admin to configure labels for a cluster.

          I think using script to mark labels is a great way to saving configuration works. But lots of other use cases need human intervention as well. Good examples like from Allen and me.

          Thanks,
          Wangda

          Show
          Wangda Tan added a comment - Really thanks all your comments above, As Sandy, Alejandro and Allen mentioned, concerns of centralized configuration. My thinking is, node label is more dynamic comparing to any other existing options of NM. An important use case we can see is, some customers want to mark label on each node indicate which department/team the node belongs to, when a new team comes in and new machines added, labels may need to be changed. And also, it is possible that the whole cluster is booked to run some huge batch job at 12am-2am for example. So such labels will be changed frequently. If we only have distributed configuration on each node, it is a nightmare for admins to re-configure. I think we should have a same internal interface for destributed/centralized configuration. Like what we've done for RMStateStore. And as Jian Fang mentioned, doubt about the assumption for admin to configure labels for a cluster. I think using script to mark labels is a great way to saving configuration works. But lots of other use cases need human intervention as well. Good examples like from Allen and me. Thanks, Wangda
          Hide
          Alejandro Abdelnur added a comment -

          Wangda, your usecase is throwing overboard the work pf the scheduler regarding matching nodes with data locality. you can solve it in a much better way using scheduler queues configuration, which can be dynamically adjusted.

          Show
          Alejandro Abdelnur added a comment - Wangda, your usecase is throwing overboard the work pf the scheduler regarding matching nodes with data locality. you can solve it in a much better way using scheduler queues configuration, which can be dynamically adjusted.
          Hide
          Wangda Tan added a comment -

          Hi Alejandro,
          I totally understand the use case I mentioned is antithetical of the design philosophy of YARN, which should be elastically sharing resources of a multi-tenant environment. But hard partition has some important use cases, even if this is not strongly recommended.
          Like in some performance-sensitive environment. For example user may want to run HBase master/region-servers in a group of nodes, and don't want any other tasks running in these nodes even if they have free resource.
          Our current queue configuration cannot solve such problem, of course user can create a separate YARN cluster in this case, but I think make such NMs under a same RM is easy to use and manage.

          Do you agree?
          Thanks,

          Show
          Wangda Tan added a comment - Hi Alejandro, I totally understand the use case I mentioned is antithetical of the design philosophy of YARN, which should be elastically sharing resources of a multi-tenant environment. But hard partition has some important use cases, even if this is not strongly recommended. Like in some performance-sensitive environment. For example user may want to run HBase master/region-servers in a group of nodes, and don't want any other tasks running in these nodes even if they have free resource. Our current queue configuration cannot solve such problem, of course user can create a separate YARN cluster in this case, but I think make such NMs under a same RM is easy to use and manage. Do you agree? Thanks,
          Hide
          Alejandro Abdelnur added a comment -

          Wangda, i'm afraid i'm lost with your last comment. i thought labels were to express desired node affinity base on a label, not to fence off nodes. i don't understand how you will achieve fencing off a node with a label unless you have a more complex annotation mechanism than just a label (ie book this node only if label X is present) also you would have to add ACLs to labels to avoid anybody simply asking for a label.

          am i missing something?

          Show
          Alejandro Abdelnur added a comment - Wangda, i'm afraid i'm lost with your last comment. i thought labels were to express desired node affinity base on a label, not to fence off nodes. i don't understand how you will achieve fencing off a node with a label unless you have a more complex annotation mechanism than just a label (ie book this node only if label X is present) also you would have to add ACLs to labels to avoid anybody simply asking for a label. am i missing something?
          Hide
          Wangda Tan added a comment -

          Alejandro,
          I think we've mentioned this in our design doc, you check check https://issues.apache.org/jira/secure/attachment/12654446/Node-labels-Requirements-Design-doc-V1.pdf, "top level requirements">"admin tools">"Security and access controls for managing Labels". Please let me know if you have any comments on it.

          Thanks ,

          Show
          Wangda Tan added a comment - Alejandro, I think we've mentioned this in our design doc, you check check https://issues.apache.org/jira/secure/attachment/12654446/Node-labels-Requirements-Design-doc-V1.pdf , "top level requirements" >"admin tools" >"Security and access controls for managing Labels". Please let me know if you have any comments on it. Thanks ,
          Hide
          Wangda Tan added a comment -

          Hi Sunil G,

          2. Regarding reservations, how about introducing node-label reservations. Ideas is like, if an application is lacking resource on a node, it can reserve on that node as well as to node-label. So when a suitable node update comes from another node in same node-label, can try allocating container in new node by unreserving from old node.

          I think this makes sense, we'd better support this. I will check our current resource reservation/unreservation logic how to support it, will keep you posted.

          3. My approach was more like have a centralized configuration, but later after some time, if want to add a new node to cluster, then it can start with a hardcoded label in its yarn-site. In your approach, we need to use REStful API or admin command to bring this node under one label. May be while start up itself this node can be set under a label. your thoughts?

          I think a problem of mixed centralized/distributed configuration I can see is, it will be hard to manage them after RM/NM restart – should we use labels specified in NM config or our centralized config? I also replied Jian Fang previously about this: https://issues.apache.org/jira/browse/YARN-796?focusedCommentId=14063316&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14063316.
          Maybe a workaround is we can define the centralized config all always overwrite distributed config. E.g. user defined "GPU" in NM config, and admin use RESTful added "FPGA", RM will serialize both "GPU", "FPGA" into a centralized storage system. And after RM restart or NM restart, RM will ignore NM config if anything defined in RM. But I still think it's better to avoid use both of them together.

          Show
          Wangda Tan added a comment - Hi Sunil G , 2. Regarding reservations, how about introducing node-label reservations. Ideas is like, if an application is lacking resource on a node, it can reserve on that node as well as to node-label. So when a suitable node update comes from another node in same node-label, can try allocating container in new node by unreserving from old node. I think this makes sense, we'd better support this. I will check our current resource reservation/unreservation logic how to support it, will keep you posted. 3. My approach was more like have a centralized configuration, but later after some time, if want to add a new node to cluster, then it can start with a hardcoded label in its yarn-site. In your approach, we need to use REStful API or admin command to bring this node under one label. May be while start up itself this node can be set under a label. your thoughts? I think a problem of mixed centralized/distributed configuration I can see is, it will be hard to manage them after RM/NM restart – should we use labels specified in NM config or our centralized config? I also replied Jian Fang previously about this: https://issues.apache.org/jira/browse/YARN-796?focusedCommentId=14063316&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14063316 . Maybe a workaround is we can define the centralized config all always overwrite distributed config. E.g. user defined "GPU" in NM config, and admin use RESTful added "FPGA", RM will serialize both "GPU", "FPGA" into a centralized storage system. And after RM restart or NM restart, RM will ignore NM config if anything defined in RM. But I still think it's better to avoid use both of them together.
          Hide
          Allen Wittenauer added a comment -

          An important use case we can see is, some customers want to mark label on each node indicate which department/team the node belongs to, when a new team comes in and new machines added, labels may need to be changed.

          You can solve this problem today by just running separate RMs. In practice, however, marking nodes for specific teams in queue systems doesn't work because doing so assumes that the capacity never changes... i.e., nodes never fail. That happens all the time, of course, thus why percentages make a lot more sense. If you absolutely want a fixed number of capacity, you still wouldn't mark specific nodes: you'd say "queue x gets y machines" with no specification of which nodes.

          And also, it is possible that the whole cluster is booked to run some huge batch job at 12am-2am for example. So such labels will be changed frequently.

          Well, no, they won't. They'll happen exactly twice a day. But it doesn't matter: you can solve this problem today too by just setting something that changes the queue acls at 12am and 2am via a cron job.

          For example user may want to run HBase master/region-servers in a group of nodes, and don't want any other tasks running in these nodes even if they have free resource. Our current queue configuration cannot solve such problem

          ... except, you guessed it: this is a solved problem today too. You just need to make sure the container sizes that are requested consume the whole node.

          If we only have distributed configuration on each node, it is a nightmare for admins to re-configure.

          Hi. My name is Allen and I'm an admin. Even if using labels for doing this type of scheduling was sane, it still wouldn't be a nightmare because any competent admin would use configuration management to roll out changes to the nodes in a controlled manner.

          But more importantly: these use cases are solved problems and have been in YARN for a very long time.

          Show
          Allen Wittenauer added a comment - An important use case we can see is, some customers want to mark label on each node indicate which department/team the node belongs to, when a new team comes in and new machines added, labels may need to be changed. You can solve this problem today by just running separate RMs. In practice, however, marking nodes for specific teams in queue systems doesn't work because doing so assumes that the capacity never changes... i.e., nodes never fail. That happens all the time, of course, thus why percentages make a lot more sense. If you absolutely want a fixed number of capacity, you still wouldn't mark specific nodes: you'd say "queue x gets y machines" with no specification of which nodes. And also, it is possible that the whole cluster is booked to run some huge batch job at 12am-2am for example. So such labels will be changed frequently. Well, no, they won't. They'll happen exactly twice a day. But it doesn't matter: you can solve this problem today too by just setting something that changes the queue acls at 12am and 2am via a cron job. For example user may want to run HBase master/region-servers in a group of nodes, and don't want any other tasks running in these nodes even if they have free resource. Our current queue configuration cannot solve such problem ... except, you guessed it: this is a solved problem today too. You just need to make sure the container sizes that are requested consume the whole node. If we only have distributed configuration on each node, it is a nightmare for admins to re-configure. Hi. My name is Allen and I'm an admin. Even if using labels for doing this type of scheduling was sane, it still wouldn't be a nightmare because any competent admin would use configuration management to roll out changes to the nodes in a controlled manner. But more importantly: these use cases are solved problems and have been in YARN for a very long time.
          Hide
          Wangda Tan added a comment -

          You can solve this problem today by just running separate RMs.

          I think it's not good for configure, user need maintain several configuration folders in their nodes for submission job.

          In practice, however, marking nodes for specific teams in queue systems doesn't work because doing so assumes that the capacity never changes... i.e

          It is possible that you cannot replace a failure node by a random node in heterogeneous cluster. E.g. only some nodes have GPUs, and these nodes will be dedicated to be used by data scientist team. Percentage of queue capacity doesn't make a lot of sense here.

          ... except, you guessed it: this is a solved problem today too. You just need to make sure the container sizes that are requested consume the whole node.

          Assume a HBase master want to run a node have 64G mem and infiniband. You can ask a 64G mem container, but it may be like to be allocated to a 128G node but doesn't have infiniband.
          Again, it's another heterogeneous issue.
          And ask for such a big container may need take a great amount of time, wait for resource reservation, etc.

          it still wouldn't be a nightmare because any competent admin would use configuration management to roll out changes to the nodes in a controlled manner.

          It is very likely not every admin has scripts like you, especially some new YARN users, we'd better make this feature can be used out-of-box

          Show
          Wangda Tan added a comment - You can solve this problem today by just running separate RMs. I think it's not good for configure, user need maintain several configuration folders in their nodes for submission job. In practice, however, marking nodes for specific teams in queue systems doesn't work because doing so assumes that the capacity never changes... i.e It is possible that you cannot replace a failure node by a random node in heterogeneous cluster. E.g. only some nodes have GPUs, and these nodes will be dedicated to be used by data scientist team. Percentage of queue capacity doesn't make a lot of sense here. ... except, you guessed it: this is a solved problem today too. You just need to make sure the container sizes that are requested consume the whole node. Assume a HBase master want to run a node have 64G mem and infiniband. You can ask a 64G mem container, but it may be like to be allocated to a 128G node but doesn't have infiniband. Again, it's another heterogeneous issue. And ask for such a big container may need take a great amount of time, wait for resource reservation, etc. it still wouldn't be a nightmare because any competent admin would use configuration management to roll out changes to the nodes in a controlled manner. It is very likely not every admin has scripts like you, especially some new YARN users, we'd better make this feature can be used out-of-box
          Hide
          Allen Wittenauer added a comment -

          Then let me be more blunt about it:

          I'm -1 this patch if I can't do dynamic labels from the node manager via a script.

          Show
          Allen Wittenauer added a comment - Then let me be more blunt about it: I'm -1 this patch if I can't do dynamic labels from the node manager via a script.
          Hide
          Wangda Tan added a comment -

          Allen,
          I think what we was just talking about is how to support hard partition use case in YARN, aren't we? I'm surprised to get a "-1" here, Nobody has ever said dynamic labeling from NM will not be supported.

          Show
          Wangda Tan added a comment - Allen, I think what we was just talking about is how to support hard partition use case in YARN, aren't we? I'm surprised to get a "-1" here, Nobody has ever said dynamic labeling from NM will not be supported.
          Hide
          Alejandro Abdelnur added a comment -

          Wandga, previously I've missed the new doc explaining label predicates. Thanks for pointing it out.

          How about first shooting for the following?

          • RM has list of valid labels. (hot reloadable)
          • NMs have list of labels. (hot reloadable)
          • NMs report labels at registration time and on heartbeats when they change
          • label-expressions support && (AND) only
          • app able to specify a label-expression when making a resource request
          • queues to AND augment the label expression with the queue label-expression

          And later we can add (in a backwards compatible way)

          • add support for OR and NOT to label-expressions
          • add label ACLs
          • centralized per NM configuration, REST API for it, etc, etc

          Thoughts?

          Show
          Alejandro Abdelnur added a comment - Wandga, previously I've missed the new doc explaining label predicates. Thanks for pointing it out. How about first shooting for the following? RM has list of valid labels. (hot reloadable) NMs have list of labels. (hot reloadable) NMs report labels at registration time and on heartbeats when they change label-expressions support && (AND) only app able to specify a label-expression when making a resource request queues to AND augment the label expression with the queue label-expression And later we can add (in a backwards compatible way) add support for OR and NOT to label-expressions add label ACLs centralized per NM configuration, REST API for it, etc, etc Thoughts?
          Hide
          Jian Fang added a comment -
          • RM has list of valid labels. (hot reloadable)

          This requires that RM has a global picture of the cluster before it starts, which is unlikely to be true in our use case where we provide hadoop as a cloud platform and the RM does not have any information about the slave nodes until they join the cluster. Why not just treat all registered lables from NMs as valid ones? Label validation could be just for resource requests.

          • label-expressions support && (AND) only

          At least in our use case, OR is often used, not AND

          Show
          Jian Fang added a comment - RM has list of valid labels. (hot reloadable) This requires that RM has a global picture of the cluster before it starts, which is unlikely to be true in our use case where we provide hadoop as a cloud platform and the RM does not have any information about the slave nodes until they join the cluster. Why not just treat all registered lables from NMs as valid ones? Label validation could be just for resource requests. label-expressions support && (AND) only At least in our use case, OR is often used, not AND
          Hide
          Wangda Tan added a comment -

          Hi Tucu,
          Thanks for providing thoughts about how to stage development works. It's reasonable and we're trying to scope work for first shooting as well.
          Will keep you posted.

          Thanks,
          Wangda

          Show
          Wangda Tan added a comment - Hi Tucu, Thanks for providing thoughts about how to stage development works. It's reasonable and we're trying to scope work for first shooting as well. Will keep you posted. Thanks, Wangda
          Hide
          Wangda Tan added a comment -

          Jian Fang,
          I think it's make sense to make RM has a global picture because we can prevent typos created by admin manually filling labels on NM config, etc.
          In another hand, I think your use case is also reasonable,
          We'd better need to support both of them, as well as "OR" label expression. Will keep you posted when we made a plan.

          Thanks,
          Wangda

          Show
          Wangda Tan added a comment - Jian Fang, I think it's make sense to make RM has a global picture because we can prevent typos created by admin manually filling labels on NM config, etc. In another hand, I think your use case is also reasonable, We'd better need to support both of them, as well as "OR" label expression. Will keep you posted when we made a plan. Thanks, Wangda
          Hide
          Yuliya Feldman added a comment -

          To everybody that were so involved in providing input for last couple of days
          I can provide support for App, Queue and Queue Label Policy Expression support.
          Also did some performance measurements - with 1000 entries with nodes and their labels it takes about additional 700 ms to process 1mln requests (hot cache). If will need reevaluate on every ResourceRequest within App performance will go down
          This should cover

          label-expressions support && (AND) only
          app able to specify a label-expression when making a resource request - kind of (do per application at the moment, not per every resource request)
          queues to AND augment the label expression with the queue label-expression
          add support for OR and NOT to label-expressions

          As far as

          RM has list of valid labels. (hot reloadable)
          NMs have list of labels. (hot reloadable)

          With file in DFS you can get hot reloadable valid (unless somebody makes typo) labels on RM

          Tan, Wangda - How do you want to proceed here?

          Show
          Yuliya Feldman added a comment - To everybody that were so involved in providing input for last couple of days I can provide support for App, Queue and Queue Label Policy Expression support. Also did some performance measurements - with 1000 entries with nodes and their labels it takes about additional 700 ms to process 1mln requests (hot cache). If will need reevaluate on every ResourceRequest within App performance will go down This should cover label-expressions support && (AND) only app able to specify a label-expression when making a resource request - kind of (do per application at the moment, not per every resource request) queues to AND augment the label expression with the queue label-expression add support for OR and NOT to label-expressions As far as RM has list of valid labels. (hot reloadable) NMs have list of labels. (hot reloadable) With file in DFS you can get hot reloadable valid (unless somebody makes typo) labels on RM Tan, Wangda - How do you want to proceed here?
          Hide
          Yuliya Feldman added a comment -

          Since I did not hear anything for a week in regards to joint effort here is a first version of patch based on "LabelBasedScheduling" design document.

          Show
          Yuliya Feldman added a comment - Since I did not hear anything for a week in regards to joint effort here is a first version of patch based on "LabelBasedScheduling" design document.
          Hide
          Yuliya Feldman added a comment -

          First patch based on "LabelBasedScheduling" design document

          Show
          Yuliya Feldman added a comment - First patch based on "LabelBasedScheduling" design document
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12658367/YARN-796.patch.1
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4467//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658367/YARN-796.patch.1 against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4467//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12658367/YARN-796.patch.1
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4468//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658367/YARN-796.patch.1 against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4468//console This message is automatically generated.
          Hide
          Gera Shegalov added a comment -

          Hi Yuliya Feldman, thanks for posting the patch. Please rebase it since it no longer applies.

          Show
          Gera Shegalov added a comment - Hi Yuliya Feldman , thanks for posting the patch. Please rebase it since it no longer applies.
          Hide
          Yuliya Feldman added a comment -

          Yes, noticed - will repost in a moment

          Show
          Yuliya Feldman added a comment - Yes, noticed - will repost in a moment
          Hide
          Yuliya Feldman added a comment -

          Patch to comply with svn

          Show
          Yuliya Feldman added a comment - Patch to comply with svn
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12658371/YARN-796.patch.2
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4469//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658371/YARN-796.patch.2 against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4469//console This message is automatically generated.
          Hide
          Yuliya Feldman added a comment -

          Rebased from trunk

          Show
          Yuliya Feldman added a comment - Rebased from trunk
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12658377/YARN-796.patch.3
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 javadoc. The javadoc tool appears to have generated 2 warning messages.
          See https://builds.apache.org/job/PreCommit-YARN-Build/4470//artifact/trunk/patchprocess/diffJavadocWarnings.txt for details.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 4 new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

          org.apache.hadoop.yarn.client.TestRMAdminCLI

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4470//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/4470//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4470//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658377/YARN-796.patch.3 against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 4 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. -1 javadoc . The javadoc tool appears to have generated 2 warning messages. See https://builds.apache.org/job/PreCommit-YARN-Build/4470//artifact/trunk/patchprocess/diffJavadocWarnings.txt for details. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 4 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.client.TestRMAdminCLI +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4470//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/4470//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4470//console This message is automatically generated.
          Hide
          Yuliya Feldman added a comment -

          Fixing failed Test, FindBugs and JavaDocs

          Show
          Yuliya Feldman added a comment - Fixing failed Test, FindBugs and JavaDocs
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12658538/YARN-796.patch4
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

          org.apache.hadoop.mapred.TestMRWithDistributedCache
          org.apache.hadoop.mapred.TestJobClientGetJob
          org.apache.hadoop.mapred.TestLocalModeWithNewApis
          org.apache.hadoop.mapreduce.TestMapReduce
          org.apache.hadoop.mapreduce.lib.input.TestLineRecordReaderJobs
          org.apache.hadoop.mapred.jobcontrol.TestLocalJobControl
          org.apache.hadoop.mapred.TestJobCounters
          org.apache.hadoop.mapred.TestLocalMRNotification
          org.apache.hadoop.mapred.lib.TestDelegatingInputFormat
          org.apache.hadoop.mapred.TestReduceFetch
          org.apache.hadoop.mapreduce.TestMapReduceLazyOutput
          org.apache.hadoop.mapreduce.lib.join.TestJoinProperties
          org.apache.hadoop.mapred.lib.TestMultithreadedMapRunner
          org.apache.hadoop.mapred.TestClusterMRNotification
          org.apache.hadoop.mapreduce.v2.TestMRAppWithCombiner
          org.apache.hadoop.mapreduce.lib.chain.TestSingleElementChain
          org.apache.hadoop.mapreduce.TestMapperReducerCleanup
          org.apache.hadoop.mapreduce.security.TestBinaryTokenFile
          org.apache.hadoop.mapreduce.v2.TestMRJobsWithProfiler
          org.apache.hadoop.fs.TestFileSystem
          org.apache.hadoop.mapreduce.TestLargeSort
          org.apache.hadoop.mapred.join.TestDatamerge
          org.apache.hadoop.mapreduce.lib.input.TestMultipleInputs
          org.apache.hadoop.mapred.TestLazyOutput
          org.apache.hadoop.mapred.TestTaskCommit
          org.apache.hadoop.mapreduce.TestMRJobClient
          org.apache.hadoop.mapreduce.security.TestMRCredentials
          org.apache.hadoop.mapred.TestMiniMRWithDFSWithDistinctUsers
          org.apache.hadoop.mapred.lib.TestChainMapReduce
          org.apache.hadoop.mapreduce.lib.fieldsel.TestMRFieldSelection
          org.apache.hadoop.mapreduce.lib.partition.TestMRKeyFieldBasedComparator
          org.apache.hadoop.mapreduce.lib.db.TestDataDrivenDBInputFormat
          org.apache.hadoop.mapred.TestSpecialCharactersInOutputPath
          org.apache.hadoop.mapreduce.v2.TestMRJobs
          org.apache.hadoop.mapred.TestMapRed
          org.apache.hadoop.mapred.lib.TestKeyFieldBasedComparator
          org.apache.hadoop.mapreduce.lib.input.TestCombineFileInputFormat
          org.apache.hadoop.mapreduce.v2.TestNonExistentJob
          org.apache.hadoop.mapreduce.lib.input.TestDelegatingInputFormat
          org.apache.hadoop.mapred.TestMiniMRChildTask
          org.apache.hadoop.fs.slive.TestSlive
          org.apache.hadoop.mapred.TestComparators
          org.apache.hadoop.mapreduce.v2.TestUberAM
          org.apache.hadoop.mapred.TestMiniMRClasspath
          org.apache.hadoop.mapred.TestMapOutputType
          org.apache.hadoop.mapreduce.lib.output.TestJobOutputCommitter
          org.apache.hadoop.mapred.lib.aggregate.TestAggregates
          org.apache.hadoop.ipc.TestMRCJCSocketFactory
          org.apache.hadoop.mapreduce.TestValueIterReset
          org.apache.hadoop.mapred.TestMRCJCFileInputFormat
          org.apache.hadoop.mapreduce.lib.aggregate.TestMapReduceAggregates
          org.apache.hadoop.mapred.TestReporter
          org.apache.hadoop.mapred.TestFileOutputFormat
          org.apache.hadoop.mapreduce.lib.chain.TestMapReduceChain
          org.apache.hadoop.mapred.TestReduceFetchFromPartialMem
          org.apache.hadoop.mapreduce.TestMapCollection
          org.apache.hadoop.mapreduce.TestLocalRunner
          org.apache.hadoop.mapreduce.lib.output.TestMRMultipleOutputs
          org.apache.hadoop.mapreduce.v2.TestMRJobsWithHistoryService
          org.apache.hadoop.mapred.TestMerge
          org.apache.hadoop.mapreduce.v2.TestMROldApiJobs
          org.apache.hadoop.mapred.TestCollect
          org.apache.hadoop.mapreduce.security.ssl.TestEncryptedShuffle
          org.apache.hadoop.mapreduce.v2.TestMiniMRProxyUser
          org.apache.hadoop.fs.TestDFSIO
          org.apache.hadoop.mapred.TestUserDefinedCounters
          org.apache.hadoop.mapreduce.TestNewCombinerGrouping
          org.apache.hadoop.conf.TestNoDefaultsJobConf
          org.apache.hadoop.mapred.TestJobName
          org.apache.hadoop.mapreduce.v2.TestMRAMWithNonNormalizedCapabilities
          org.apache.hadoop.mapreduce.v2.TestSpeculativeExecution
          org.apache.hadoop.mapred.TestLineRecordReaderJobs
          org.apache.hadoop.mapreduce.lib.chain.TestChainErrors
          org.apache.hadoop.mapred.lib.TestMultipleOutputs
          org.apache.hadoop.mapred.TestJobCleanup
          org.apache.hadoop.mapreduce.TestMROutputFormat
          org.apache.hadoop.mapred.TestMiniMRBringup
          org.apache.hadoop.mapred.TestOldCombinerGrouping
          org.apache.hadoop.mapred.TestClusterMapReduceTestCase
          org.apache.hadoop.mapreduce.TestChild
          org.apache.hadoop.mapreduce.lib.join.TestJoinDatamerge
          org.apache.hadoop.mapred.TestNetworkedJob
          org.apache.hadoop.mapred.TestMiniMRClientCluster
          org.apache.hadoop.mapred.TestClientRedirect
          org.apache.hadoop.mapreduce.v2.TestRMNMInfo
          org.apache.hadoop.mapred.TestJavaSerialization
          org.apache.hadoop.mapred.TestFieldSelection
          org.apache.hadoop.mapred.TestJobSysDirWithDFS
          org.apache.hadoop.mapreduce.lib.map.TestMultithreadedMapper
          org.apache.hadoop.yarn.client.TestResourceTrackerOnHA
          org.apache.hadoop.yarn.client.TestApplicationMasterServiceOnHA
          org.apache.hadoop.yarn.client.TestRMFailover
          org.apache.hadoop.yarn.client.api.impl.TestAMRMClient
          org.apache.hadoop.yarn.client.api.impl.TestNMClient
          org.apache.hadoop.yarn.client.TestGetGroups
          org.apache.hadoop.yarn.client.TestResourceManagerAdministrationProtocolPBClientImpl
          org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA
          org.apache.hadoop.yarn.client.api.impl.TestYarnClient
          org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebappAuthentication
          org.apache.hadoop.yarn.server.resourcemanager.TestMoveApplication
          org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerQueueACLs
          org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens
          org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore
          org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerQueueACLs
          org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps
          org.apache.hadoop.yarn.server.resourcemanager.TestRMAdminService
          org.apache.hadoop.yarn.server.resourcemanager.TestRMHA
          org.apache.hadoop.yarn.server.resourcemanager.TestApplicationACLs

          The following test timeouts occurred in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

          org.apache.hadoop.mapreduce.lib.jobcontrol.TestMapReduceJobControl
          org.apache.hadoop.mapred.pipes.TestPipeApplication

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4475//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4475//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658538/YARN-796.patch4 against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 5 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.mapred.TestMRWithDistributedCache org.apache.hadoop.mapred.TestJobClientGetJob org.apache.hadoop.mapred.TestLocalModeWithNewApis org.apache.hadoop.mapreduce.TestMapReduce org.apache.hadoop.mapreduce.lib.input.TestLineRecordReaderJobs org.apache.hadoop.mapred.jobcontrol.TestLocalJobControl org.apache.hadoop.mapred.TestJobCounters org.apache.hadoop.mapred.TestLocalMRNotification org.apache.hadoop.mapred.lib.TestDelegatingInputFormat org.apache.hadoop.mapred.TestReduceFetch org.apache.hadoop.mapreduce.TestMapReduceLazyOutput org.apache.hadoop.mapreduce.lib.join.TestJoinProperties org.apache.hadoop.mapred.lib.TestMultithreadedMapRunner org.apache.hadoop.mapred.TestClusterMRNotification org.apache.hadoop.mapreduce.v2.TestMRAppWithCombiner org.apache.hadoop.mapreduce.lib.chain.TestSingleElementChain org.apache.hadoop.mapreduce.TestMapperReducerCleanup org.apache.hadoop.mapreduce.security.TestBinaryTokenFile org.apache.hadoop.mapreduce.v2.TestMRJobsWithProfiler org.apache.hadoop.fs.TestFileSystem org.apache.hadoop.mapreduce.TestLargeSort org.apache.hadoop.mapred.join.TestDatamerge org.apache.hadoop.mapreduce.lib.input.TestMultipleInputs org.apache.hadoop.mapred.TestLazyOutput org.apache.hadoop.mapred.TestTaskCommit org.apache.hadoop.mapreduce.TestMRJobClient org.apache.hadoop.mapreduce.security.TestMRCredentials org.apache.hadoop.mapred.TestMiniMRWithDFSWithDistinctUsers org.apache.hadoop.mapred.lib.TestChainMapReduce org.apache.hadoop.mapreduce.lib.fieldsel.TestMRFieldSelection org.apache.hadoop.mapreduce.lib.partition.TestMRKeyFieldBasedComparator org.apache.hadoop.mapreduce.lib.db.TestDataDrivenDBInputFormat org.apache.hadoop.mapred.TestSpecialCharactersInOutputPath org.apache.hadoop.mapreduce.v2.TestMRJobs org.apache.hadoop.mapred.TestMapRed org.apache.hadoop.mapred.lib.TestKeyFieldBasedComparator org.apache.hadoop.mapreduce.lib.input.TestCombineFileInputFormat org.apache.hadoop.mapreduce.v2.TestNonExistentJob org.apache.hadoop.mapreduce.lib.input.TestDelegatingInputFormat org.apache.hadoop.mapred.TestMiniMRChildTask org.apache.hadoop.fs.slive.TestSlive org.apache.hadoop.mapred.TestComparators org.apache.hadoop.mapreduce.v2.TestUberAM org.apache.hadoop.mapred.TestMiniMRClasspath org.apache.hadoop.mapred.TestMapOutputType org.apache.hadoop.mapreduce.lib.output.TestJobOutputCommitter org.apache.hadoop.mapred.lib.aggregate.TestAggregates org.apache.hadoop.ipc.TestMRCJCSocketFactory org.apache.hadoop.mapreduce.TestValueIterReset org.apache.hadoop.mapred.TestMRCJCFileInputFormat org.apache.hadoop.mapreduce.lib.aggregate.TestMapReduceAggregates org.apache.hadoop.mapred.TestReporter org.apache.hadoop.mapred.TestFileOutputFormat org.apache.hadoop.mapreduce.lib.chain.TestMapReduceChain org.apache.hadoop.mapred.TestReduceFetchFromPartialMem org.apache.hadoop.mapreduce.TestMapCollection org.apache.hadoop.mapreduce.TestLocalRunner org.apache.hadoop.mapreduce.lib.output.TestMRMultipleOutputs org.apache.hadoop.mapreduce.v2.TestMRJobsWithHistoryService org.apache.hadoop.mapred.TestMerge org.apache.hadoop.mapreduce.v2.TestMROldApiJobs org.apache.hadoop.mapred.TestCollect org.apache.hadoop.mapreduce.security.ssl.TestEncryptedShuffle org.apache.hadoop.mapreduce.v2.TestMiniMRProxyUser org.apache.hadoop.fs.TestDFSIO org.apache.hadoop.mapred.TestUserDefinedCounters org.apache.hadoop.mapreduce.TestNewCombinerGrouping org.apache.hadoop.conf.TestNoDefaultsJobConf org.apache.hadoop.mapred.TestJobName org.apache.hadoop.mapreduce.v2.TestMRAMWithNonNormalizedCapabilities org.apache.hadoop.mapreduce.v2.TestSpeculativeExecution org.apache.hadoop.mapred.TestLineRecordReaderJobs org.apache.hadoop.mapreduce.lib.chain.TestChainErrors org.apache.hadoop.mapred.lib.TestMultipleOutputs org.apache.hadoop.mapred.TestJobCleanup org.apache.hadoop.mapreduce.TestMROutputFormat org.apache.hadoop.mapred.TestMiniMRBringup org.apache.hadoop.mapred.TestOldCombinerGrouping org.apache.hadoop.mapred.TestClusterMapReduceTestCase org.apache.hadoop.mapreduce.TestChild org.apache.hadoop.mapreduce.lib.join.TestJoinDatamerge org.apache.hadoop.mapred.TestNetworkedJob org.apache.hadoop.mapred.TestMiniMRClientCluster org.apache.hadoop.mapred.TestClientRedirect org.apache.hadoop.mapreduce.v2.TestRMNMInfo org.apache.hadoop.mapred.TestJavaSerialization org.apache.hadoop.mapred.TestFieldSelection org.apache.hadoop.mapred.TestJobSysDirWithDFS org.apache.hadoop.mapreduce.lib.map.TestMultithreadedMapper org.apache.hadoop.yarn.client.TestResourceTrackerOnHA org.apache.hadoop.yarn.client.TestApplicationMasterServiceOnHA org.apache.hadoop.yarn.client.TestRMFailover org.apache.hadoop.yarn.client.api.impl.TestAMRMClient org.apache.hadoop.yarn.client.api.impl.TestNMClient org.apache.hadoop.yarn.client.TestGetGroups org.apache.hadoop.yarn.client.TestResourceManagerAdministrationProtocolPBClientImpl org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA org.apache.hadoop.yarn.client.api.impl.TestYarnClient org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebappAuthentication org.apache.hadoop.yarn.server.resourcemanager.TestMoveApplication org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerQueueACLs org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerQueueACLs org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.TestRMAdminService org.apache.hadoop.yarn.server.resourcemanager.TestRMHA org.apache.hadoop.yarn.server.resourcemanager.TestApplicationACLs The following test timeouts occurred in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.mapreduce.lib.jobcontrol.TestMapReduceJobControl org.apache.hadoop.mapred.pipes.TestPipeApplication +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4475//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4475//console This message is automatically generated.
          Hide
          Steve Loughran added a comment -

          I'm not going to comment on the current architecture; I think I need to understand the proposals better to understand what is being proposed for the first iteration. And iterations are, as others propose, a good way to do it.

          FWIW, the SLIDER-81 case is about allowing us to allocate bits of a YARN cluster explicitly to groups, having queues select labels should suffice there. Although you can get exclusive use of a node by asking for all its resources, that does not guarantee that a node will be free for your team (ignoring premption)

          There's also a possible need in future: label-based block placement. Can I label a set of nodes "hbase-production" and be confident that >1 node in that set will have a copy of the hbase data blocks? I don't think its timely to address that today, but having the means to do so would be useful in future. That argues for having the HDFS layer being able to see/receive the same label data.

          the patch

          This is a major patch – it always hurts me to see how much coding we need to do to work with protobuf, as that's a major portion of the diff.

          1. too much duplication of "-showLabels" and "-refreshLabels" strings in the code. These should be made constants somewhere.
          2. why is {{ getClusterNodeLabels() }} catching YarnException and rethrowing as an IOE? Can't it just be added to the signature?
          3. version of net.java.dev.eval dependency must go into hadoop-project POM.
          4. Could you use SLF4J for the logging in new classes...we're slowly moving towards that everywhere
          5. Label manager should just be a service in its own right. If you do want to wrap it in LabelManagementService
            then can you (a) justify this and (b) match the full lifecycle
          6. I don't want yet another text file format for configuration. the label config should either be hadoop XML or some JSON syntax. Why? Helps other tools parse and generate.

          tests

          1. Tests assume that "/tmp/labelFile" is writeable; they should use "./target/labelFile"} or something else under ./target
          2. use assertEquals in service state tests too
          3. why the sleep in setup? that adds 6 seconds/test method
          4. equalsIgnoreCase mustn't be used, go .toLower(LOCALE_EN).equals() for i18n.
          5. there's a lot of testing that could be factored into commonality (probes for configs files, assertContains on labels). This will simplify the tests
          6. we'll need tests that the schedulers work with labels, obviously.
          Show
          Steve Loughran added a comment - I'm not going to comment on the current architecture; I think I need to understand the proposals better to understand what is being proposed for the first iteration. And iterations are, as others propose, a good way to do it. FWIW, the SLIDER-81 case is about allowing us to allocate bits of a YARN cluster explicitly to groups, having queues select labels should suffice there. Although you can get exclusive use of a node by asking for all its resources, that does not guarantee that a node will be free for your team (ignoring premption) There's also a possible need in future: label-based block placement. Can I label a set of nodes "hbase-production" and be confident that >1 node in that set will have a copy of the hbase data blocks? I don't think its timely to address that today, but having the means to do so would be useful in future. That argues for having the HDFS layer being able to see/receive the same label data. the patch This is a major patch – it always hurts me to see how much coding we need to do to work with protobuf, as that's a major portion of the diff. too much duplication of "-showLabels" and "-refreshLabels" strings in the code. These should be made constants somewhere. why is {{ getClusterNodeLabels() }} catching YarnException and rethrowing as an IOE? Can't it just be added to the signature? version of net.java.dev.eval dependency must go into hadoop-project POM. Could you use SLF4J for the logging in new classes...we're slowly moving towards that everywhere Label manager should just be a service in its own right. If you do want to wrap it in LabelManagementService then can you (a) justify this and (b) match the full lifecycle I don't want yet another text file format for configuration. the label config should either be hadoop XML or some JSON syntax. Why? Helps other tools parse and generate. tests Tests assume that "/tmp/labelFile" is writeable; they should use "./target/labelFile" } or something else under ./target use assertEquals in service state tests too why the sleep in setup? that adds 6 seconds/test method equalsIgnoreCase mustn't be used, go .toLower(LOCALE_EN).equals() for i18n. there's a lot of testing that could be factored into commonality (probes for configs files, assertContains on labels). This will simplify the tests we'll need tests that the schedulers work with labels, obviously.
          Hide
          Yuliya Feldman added a comment -

          I am out of country now with very poor internet connectivity, so won't be able to answer comprehensively.
          To: Steve Loughran
          Really appreciate your comments
          I definitely agree with majority of the comments you made. Especially with how much code it takes to add a single method to rmadmin command - may be we missed something, but it is really too much.
          regarding wrapper on top of LabelManager to behave as a service - in realy life situation service is instantiated once per process - which is exactly what we need, as it is really a singleton, but since UnitTests create service per unit test it created issues with Service States in this case.
          About waiting for 6 secs between tests - allowing labels ile to reload - can be reduced further.

          Show
          Yuliya Feldman added a comment - I am out of country now with very poor internet connectivity, so won't be able to answer comprehensively. To: Steve Loughran Really appreciate your comments I definitely agree with majority of the comments you made. Especially with how much code it takes to add a single method to rmadmin command - may be we missed something, but it is really too much. regarding wrapper on top of LabelManager to behave as a service - in realy life situation service is instantiated once per process - which is exactly what we need, as it is really a singleton, but since UnitTests create service per unit test it created issues with Service States in this case. About waiting for 6 secs between tests - allowing labels ile to reload - can be reduced further.
          Hide
          Wangda Tan added a comment -

          We've been carefully thinking about our requirements, the discussion from the community and also the approach outlined in your design document. We have attempted to consolidate them and put them together to a new design doc. We believe this proposal addresses the core use cases listed for this feature and is also implementable in a phased timely manner. I want to work with you to consolidate the APIs proposed to get this feature to completion soon. Please kindly review and comment on the design when you get a chance.

          Thanks,
          Wangda Tan

          Show
          Wangda Tan added a comment - We've been carefully thinking about our requirements, the discussion from the community and also the approach outlined in your design document. We have attempted to consolidate them and put them together to a new design doc. We believe this proposal addresses the core use cases listed for this feature and is also implementable in a phased timely manner. I want to work with you to consolidate the APIs proposed to get this feature to completion soon. Please kindly review and comment on the design when you get a chance. Thanks, Wangda Tan
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12662291/Node-labels-Requirements-Design-doc-V2.pdf
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4645//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12662291/Node-labels-Requirements-Design-doc-V2.pdf against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4645//console This message is automatically generated.
          Hide
          Wangda Tan added a comment -

          Hi guys,
          Thanks for your input in the past several weeks, I implemented a patch based the design doc: https://issues.apache.org/jira/secure/attachment/12662291/Node-labels-Requirements-Design-doc-V2.pdf during the past two weeks. Really appreciate if you can take a look. The patch is: YARN-796.node-label.demo.patch.1 (I made a longer name to not confuse with other patches).

          Already included in this patch:

          • Protocol changes for ResourceRequest, ApplicationSubmissionContext (leveraged contribution from Yuliya's patch, thanks). also updated AMRMClient
          • RMAdmin changes to dynamically update labels of node (add/set/remove), also updated RMAdmin CLI
          • Capacity scheduler related changes including:
            • headroom calculation, preemption, container allocation respect labels.
            • Allow user set list of labels of a queue can access in capacity-scheduler.xml
          • A centralized node label manager can be updated dynamically to add/set/remove labels, and can store labels to file system. It will work with RM restart/HA scenario (Similar to RMStateStore).
          • Support set --labels option in distributed shell, we can use distributed shell to test this feature
          • Related unit tests

          Will include later:

          • RM REST APIs for node label
          • Distributed configuration (set labels in yarn-site.xml of NMs)
          • Support labels in FairScheduler

          Try this patch
          1. Create a capacity-scheduler.xml with labels accessible on queues

             root
             /  \
            a    b
            |    |
            a1   b1
          
          a.capacity = 50, b.capacity = 50 
          a1.capacity = 100, b1.capacity = 100
          
          And a.label = red,blue; b.label = blue,green
          <property>
              <name>yarn.scheduler.capacity.root.a.labels</name>
              <value>red, blue</value>
          </property>
          
          <property>
              <name>yarn.scheduler.capacity.root.b.labels</name>
              <value>blue, green</value>
          </property>)
          

          This means queue a (And its sub queues) CAN access label red and blue; queue b (And its sub queues) CAN access label blue and green

          2. Create a node-labels.json locally, this is initial labels on nodes, (you can dynamically change it using rmadmin CLI while RM is running, you don't have to do it). And set yarn.resourcemanager.labels.node-to-label-json.path to file:///path/to/node-labels.xml

          {
             "host1":{
                 "labels":["red", "blue"]
             },
             "host2":{
                 "labels":["blue", "green"]
             }
          }
          

          This sets red/blue labels on host1, and sets blue/green labels on host2

          3. Start Yarn cluster (if you have several nodes in the cluster, you need launch HDFS to use distributed shell)

          • Submit a distributed shell:
            hadoop jar path/to/*distributedshell*.jar org.apache.hadoop.yarn.applications.distributedshell.Client -shell_command hostname -jar path/to/*distributedshell*.jar -num_containers 10 -labels "red && blue" -queue a1
            

            This will run a distributed shell, launch 10 containers, and the command run is "hostname", asked label is "red && blue", all containers will be allocated on host1,

          Some other examples:

          • -queue a1 -labels "red && green", this will be rejected, because queue a1 cannot access label green
          • -queue a1 -labels "blue", some containers will be allocated on host1, and some others will be allocated to host2, because both of host1/host2 contain "blue" label
          • -queue b1 -labels "green", all containers will be allocated on host2

          4. Dynamically update labels using rmadmin CLI

          // dynamically add labels x, y to label manager
          yarn rmadmin -addLabels x,y
          
          // dynamically set label x on node1, and label y on node2
          yarn rmadmin -setNodeToLabels "node1:x;node2:x,y
          
          // remove labels from label manager, and also remove labels on nodes
          yarn rmadmin -removeLabels x
          

          Two more examples for node label
          1. Labels as constraints:

          Queue structure:
              root
             / | \
            a  b  c
          
          a has label: WINDOWS, LINUX, GPU
          b has label: WINDOWS, LINUX, LARGE_MEM
          c doesn't have label
          
          25 nodes in the cluster:
          h1-h5:   LINUX, GPU
          h6-h10:  LINUX,
          h11-h15: LARGE_MEM, LINUX
          h16-h20: LARGE_MEM, WINDOWS
          h21-h25: <empty>
          

          If you want "LINUX && GPU" resource, you should submit to queue-a, and set label in Resource Request to "LINUX && GPU"
          If you want "LARGE_MEM" resource, and don't mind its OS, you can submit to queue-b, and set label in Resource Request to "LARGE_MEM"
          If you want to allocate on nodes don't have labels (h21-h25), you can submit it to any queue, and leave label in Resource Request empty

          2. Labels to hard partition cluster

          Queue structure:
              root
             / | \
            a  b  c
          
          a has label: MARKETING
          b has label: HR
          c has label: RD
          
          15 nodes in the cluster:
          h1-h5:   MARKETING
          h6-h10:  HR
          h11-h15: RD
          

          Now cluster is hard partitioned to 3 small clusters, h1-h5 for marketing, only queue-A can use it, you should set label in Resource Request to "a". Similar to HR/RD cluster.

          I appreciate your feedbacks of this patch, do you think is it correct direction? If you think it's fine, I will break down the patch to several small patches and create some sub JIRAs for easier review.

          Thanks,
          Wangda Tan

          Show
          Wangda Tan added a comment - Hi guys, Thanks for your input in the past several weeks, I implemented a patch based the design doc: https://issues.apache.org/jira/secure/attachment/12662291/Node-labels-Requirements-Design-doc-V2.pdf during the past two weeks. Really appreciate if you can take a look. The patch is: YARN-796 .node-label.demo.patch.1 (I made a longer name to not confuse with other patches). Already included in this patch: Protocol changes for ResourceRequest, ApplicationSubmissionContext (leveraged contribution from Yuliya's patch, thanks). also updated AMRMClient RMAdmin changes to dynamically update labels of node (add/set/remove), also updated RMAdmin CLI Capacity scheduler related changes including: headroom calculation, preemption, container allocation respect labels. Allow user set list of labels of a queue can access in capacity-scheduler.xml A centralized node label manager can be updated dynamically to add/set/remove labels, and can store labels to file system. It will work with RM restart/HA scenario (Similar to RMStateStore). Support set --labels option in distributed shell, we can use distributed shell to test this feature Related unit tests Will include later: RM REST APIs for node label Distributed configuration (set labels in yarn-site.xml of NMs) Support labels in FairScheduler Try this patch 1. Create a capacity-scheduler.xml with labels accessible on queues root / \ a b | | a1 b1 a.capacity = 50, b.capacity = 50 a1.capacity = 100, b1.capacity = 100 And a.label = red,blue; b.label = blue,green <property> <name>yarn.scheduler.capacity.root.a.labels</name> <value>red, blue</value> </property> <property> <name>yarn.scheduler.capacity.root.b.labels</name> <value>blue, green</value> </property>) This means queue a (And its sub queues) CAN access label red and blue; queue b (And its sub queues) CAN access label blue and green 2. Create a node-labels.json locally, this is initial labels on nodes, (you can dynamically change it using rmadmin CLI while RM is running, you don't have to do it). And set yarn.resourcemanager.labels.node-to-label-json.path to file:///path/to/node-labels.xml { "host1" :{ "labels" :[ "red" , "blue" ] }, "host2" :{ "labels" :[ "blue" , "green" ] } } This sets red/blue labels on host1, and sets blue/green labels on host2 3. Start Yarn cluster (if you have several nodes in the cluster, you need launch HDFS to use distributed shell) Submit a distributed shell: hadoop jar path/to/*distributedshell*.jar org.apache.hadoop.yarn.applications.distributedshell.Client -shell_command hostname -jar path/to/*distributedshell*.jar -num_containers 10 -labels "red && blue" -queue a1 This will run a distributed shell, launch 10 containers, and the command run is "hostname", asked label is "red && blue", all containers will be allocated on host1, Some other examples: -queue a1 -labels "red && green" , this will be rejected, because queue a1 cannot access label green -queue a1 -labels "blue" , some containers will be allocated on host1, and some others will be allocated to host2, because both of host1/host2 contain "blue" label -queue b1 -labels "green" , all containers will be allocated on host2 4. Dynamically update labels using rmadmin CLI // dynamically add labels x, y to label manager yarn rmadmin -addLabels x,y // dynamically set label x on node1, and label y on node2 yarn rmadmin -setNodeToLabels "node1:x;node2:x,y // remove labels from label manager, and also remove labels on nodes yarn rmadmin -removeLabels x Two more examples for node label 1. Labels as constraints: Queue structure: root / | \ a b c a has label: WINDOWS, LINUX, GPU b has label: WINDOWS, LINUX, LARGE_MEM c doesn't have label 25 nodes in the cluster: h1-h5: LINUX, GPU h6-h10: LINUX, h11-h15: LARGE_MEM, LINUX h16-h20: LARGE_MEM, WINDOWS h21-h25: <empty> If you want "LINUX && GPU" resource, you should submit to queue-a, and set label in Resource Request to "LINUX && GPU" If you want "LARGE_MEM" resource, and don't mind its OS, you can submit to queue-b, and set label in Resource Request to "LARGE_MEM" If you want to allocate on nodes don't have labels (h21-h25), you can submit it to any queue, and leave label in Resource Request empty 2. Labels to hard partition cluster Queue structure: root / | \ a b c a has label: MARKETING b has label: HR c has label: RD 15 nodes in the cluster: h1-h5: MARKETING h6-h10: HR h11-h15: RD Now cluster is hard partitioned to 3 small clusters, h1-h5 for marketing, only queue-A can use it, you should set label in Resource Request to "a". Similar to HR/RD cluster. I appreciate your feedbacks of this patch, do you think is it correct direction? If you think it's fine, I will break down the patch to several small patches and create some sub JIRAs for easier review. Thanks, Wangda Tan
          Hide
          Allen Wittenauer added a comment -

          I might have missed it, but I don't see dynamic labels generated from an admin provided script or class on the NM listed above. That's a must have feature to make this viable for any large installation.

          Show
          Allen Wittenauer added a comment - I might have missed it, but I don't see dynamic labels generated from an admin provided script or class on the NM listed above. That's a must have feature to make this viable for any large installation.
          Hide
          Wangda Tan added a comment -

          Hi Allen,

          I don't see dynamic labels generated from an admin provided script or class on the NM listed above

          If it means "set labels on yarn-site.xml in each NM, and NM will report such labels to RM". It should be a part of Distributed configuration (set labels in yarn-site.xml of NMs) in the TODO list.
          If it's not, could you please give me more details about what is "dynamic labels generated from an admin on the NM" in your thinking

          Show
          Wangda Tan added a comment - Hi Allen, I don't see dynamic labels generated from an admin provided script or class on the NM listed above If it means "set labels on yarn-site.xml in each NM, and NM will report such labels to RM". It should be a part of Distributed configuration (set labels in yarn-site.xml of NMs) in the TODO list. If it's not, could you please give me more details about what is "dynamic labels generated from an admin on the NM" in your thinking
          Hide
          Allen Wittenauer added a comment -

          set labels on yarn-site.xml in each NM, and NM will report such labels to RM

          This breaks configuration management; changing the yarn-site.xml on a per-node basis means ops folks will lose the ability to use system tools to verify the file's integrity (e.g., rpm -V).

          If it's not, could you please give me more details about what is "dynamic labels generated from an admin on the NM" in your thinking

          As I've said before, I basically want something similar to the health check code: I provide something executable that the NM can run at runtime that will provide the list of labels. If we need to add labels, it's updating the script which is a much smaller footprint than redeploying HADOOP_CONF_DIR everywhere.

          Show
          Allen Wittenauer added a comment - set labels on yarn-site.xml in each NM, and NM will report such labels to RM This breaks configuration management; changing the yarn-site.xml on a per-node basis means ops folks will lose the ability to use system tools to verify the file's integrity (e.g., rpm -V). If it's not, could you please give me more details about what is "dynamic labels generated from an admin on the NM" in your thinking As I've said before, I basically want something similar to the health check code: I provide something executable that the NM can run at runtime that will provide the list of labels. If we need to add labels, it's updating the script which is a much smaller footprint than redeploying HADOOP_CONF_DIR everywhere.
          Hide
          Wangda Tan added a comment -

          As I've said before, I basically want something similar to the health check code: I provide something executable that the NM can run at runtime that will provide the list of labels. If we need to add labels, it's updating the script which is a much smaller footprint than redeploying HADOOP_CONF_DIR everywhere.

          I understand now, it's meaningful since it's a flexible way for admin to set labels in NM side. Maybe add a NodeLabelCheckerService to NM similar to NodeHealthCheckerService should work. I'll create a separated JIRA for setting labels in NM side under this ticket and leave design/implementation discussion here.

          Show
          Wangda Tan added a comment - As I've said before, I basically want something similar to the health check code: I provide something executable that the NM can run at runtime that will provide the list of labels. If we need to add labels, it's updating the script which is a much smaller footprint than redeploying HADOOP_CONF_DIR everywhere. I understand now, it's meaningful since it's a flexible way for admin to set labels in NM side. Maybe add a NodeLabelCheckerService to NM similar to NodeHealthCheckerService should work. I'll create a separated JIRA for setting labels in NM side under this ticket and leave design/implementation discussion here.
          Hide
          Viplav Madasu added a comment -

          Hi Wangda,
          I was looking into your patch and noticed a bug in the patch in remove labels processing. Basically labels are not being removed. You can do a simple verification of this by removing a label twice. You don't get an error. The code that causes this issue is in the yarn client CLI code: line removeLabels:318 of RMAdminCLI.java should be
          labels.add(p);
          instead of
          labels.remove(p);

          Regards,
          Viplav

          Show
          Viplav Madasu added a comment - Hi Wangda, I was looking into your patch and noticed a bug in the patch in remove labels processing. Basically labels are not being removed. You can do a simple verification of this by removing a label twice. You don't get an error. The code that causes this issue is in the yarn client CLI code: line removeLabels:318 of RMAdminCLI.java should be labels.add(p); instead of labels.remove(p); Regards, Viplav
          Hide
          Wangda Tan added a comment -

          Hi Viplav Madasu
          Really thanks for reviewing patch and pointing this out, this patch is a little out-of-dated, I've noticed and fixed this issue already. I've attached a latest patch named "YARN-796.node-label.consolidate.1.patch".

          And I'm working on split patches of this big patch, will update on this JIRA
          Wangda

          Show
          Wangda Tan added a comment - Hi Viplav Madasu Really thanks for reviewing patch and pointing this out, this patch is a little out-of-dated, I've noticed and fixed this issue already. I've attached a latest patch named " YARN-796 .node-label.consolidate.1.patch". And I'm working on split patches of this big patch, will update on this JIRA Wangda
          Hide
          Wangda Tan added a comment -

          Attached latest consolidated patch named "YARN-796.node-label.consolidate.1.patch"

          Show
          Wangda Tan added a comment - Attached latest consolidated patch named " YARN-796 .node-label.consolidate.1.patch"
          Hide
          Wangda Tan added a comment -

          Hi guys,
          I've just created a shadow umbrella JIRA (YARN-2492) for YARN-796 (because YARN-796 is a sub JIRA), and created a bunch of sub tasks under it.
          I also updated a diagram for YARN-796 (YARN-796-Diagram.pdf), hopefully it can make you have better understanding of the overall structure and flow.

          Summary of sub tasks:

          Yuliya Feldman, do you agree about the basic proposal? I think API expose to users of our proposals are similar but different in implementation.

          Please feel free to add your comments. Thanks a lot!
          Wangda

          Show
          Wangda Tan added a comment - Hi guys, I've just created a shadow umbrella JIRA ( YARN-2492 ) for YARN-796 (because YARN-796 is a sub JIRA), and created a bunch of sub tasks under it. I also updated a diagram for YARN-796 ( YARN-796 -Diagram.pdf), hopefully it can make you have better understanding of the overall structure and flow. Summary of sub tasks: User API changes: YARN-2493 NodeLabelManager implementation: YARN-2494 (Depends on YARN-2493 ) CapacityScheduler side changes: YARN-2496 (Depends on YARN-2494 ) Respect labels when do preemption in CS: YARN-2498 (Depends on YARN-2496 ) Other changes in RM to support labels: YARN-2500 Changes in AMRMClient to support labels: YARN-2501 (Depends on YARN-2493 ) Changes in Distributed Shell to support labels: YARN-2502 (Depends on YARN-2501 ) WebUI/RMAdmin-CLI/REST-API: YARN-2503 , YARN-2504 , YARN-2505 (Depends on YARN-2494 , YARN-2496 , YARN-2500 ) Yuliya Feldman , do you agree about the basic proposal? I think API expose to users of our proposals are similar but different in implementation. Please feel free to add your comments. Thanks a lot! Wangda
          Hide
          Yuliya Feldman added a comment -

          Tan, Wangda it Is a great idea to split, otherwise it is getting too big and hard to keep track.. If you feel like assigning some JIRAs to me feel free, though I guess you are ready to roll.

          Show
          Yuliya Feldman added a comment - Tan, Wangda it Is a great idea to split, otherwise it is getting too big and hard to keep track.. If you feel like assigning some JIRAs to me feel free, though I guess you are ready to roll.
          Hide
          Wangda Tan added a comment -

          Hi Yuliya Feldman,
          Thanks for your support on this,
          I've assigned some JIRAs to myself because I've patches for them already (they're parts of big patch I uploaded today), just wait for some former JIRAs get committed and I'll split and upload.

          Please feel free to add JIRAs with no assignee to yourself if you feel interested. And I believe there're more tasks/improvements we can do for YARN-796, please create new tasks if you have any ideas

          Many thanks,
          Wangda

          Show
          Wangda Tan added a comment - Hi Yuliya Feldman , Thanks for your support on this, I've assigned some JIRAs to myself because I've patches for them already (they're parts of big patch I uploaded today), just wait for some former JIRAs get committed and I'll split and upload. Please feel free to add JIRAs with no assignee to yourself if you feel interested. And I believe there're more tasks/improvements we can do for YARN-796 , please create new tasks if you have any ideas Many thanks, Wangda
          Hide
          Yuliya Feldman added a comment -

          Tan, Wangda Yep - there are still 3 unassigned JIRAs out of 13 as of this moment.
          Please assign me JARN-2497

          Show
          Yuliya Feldman added a comment - Tan, Wangda Yep - there are still 3 unassigned JIRAs out of 13 as of this moment. Please assign me JARN-2497
          Hide
          Wangda Tan added a comment -

          Yuliya Feldman, I just asked Zhijie Shen added you to contributor list, and I've assigned it to you,
          Thanks,

          Show
          Wangda Tan added a comment - Yuliya Feldman , I just asked Zhijie Shen added you to contributor list, and I've assigned it to you, Thanks,
          Hide
          Yuliya Feldman added a comment -
          Show
          Yuliya Feldman added a comment - Tan, Wangda . OK
          Hide
          Craig Welch added a comment -

          This is a bit of a detail, but the current version of the code lowercases the nodelabels rather than respecting the given name. I don't believe this is what we want. The requirements do request case-insensitive comparison, but that is not the same as changing the case. There are a few options which come to mind:

          1. Switch to case insensitive Set's and Maps for managing the labels - TreeSet and TreeMap can be configured to operate in a case-insensitive fashion, I expect they would be OK to use for nodelables.
          2. Gate label names on the way in to force consistent case while maintaining case - a Map with lc key and original case value could be used to keep all labels for a given set of letters a consistent case (the original)
          3. Drop the requirement for case insensitivity - I'm not sure of the reasoning, I assume it is to prevent mis-types, but I'm not sure it's really so important, and there are still many opportunities for mistyping labels, I'm not sure if protecting against this one case is worth the implementation cost/complexity or the loss of the original case as specified by the user.

          I suggest 3, FWIW

          Show
          Craig Welch added a comment - This is a bit of a detail, but the current version of the code lowercases the nodelabels rather than respecting the given name. I don't believe this is what we want. The requirements do request case-insensitive comparison, but that is not the same as changing the case. There are a few options which come to mind: 1. Switch to case insensitive Set's and Maps for managing the labels - TreeSet and TreeMap can be configured to operate in a case-insensitive fashion, I expect they would be OK to use for nodelables. 2. Gate label names on the way in to force consistent case while maintaining case - a Map with lc key and original case value could be used to keep all labels for a given set of letters a consistent case (the original) 3. Drop the requirement for case insensitivity - I'm not sure of the reasoning, I assume it is to prevent mis-types, but I'm not sure it's really so important, and there are still many opportunities for mistyping labels, I'm not sure if protecting against this one case is worth the implementation cost/complexity or the loss of the original case as specified by the user. I suggest 3, FWIW
          Hide
          Allen Wittenauer added a comment -

          Agreed on option 3. Good catch!

          Show
          Allen Wittenauer added a comment - Agreed on option 3. Good catch!
          Hide
          Wangda Tan added a comment -

          Hi Craig Welch and Allen Wittenauer,
          I agree with #3 as well, since the original starting point is to avoid case-typo from users. But refer to other existing configs of YARN, like queue name of CS, different case of queue name means different queue. I prefer to drop the requirement if there's no strong opinion to do that.

          Thanks,
          Wangda

          Show
          Wangda Tan added a comment - Hi Craig Welch and Allen Wittenauer , I agree with #3 as well, since the original starting point is to avoid case-typo from users. But refer to other existing configs of YARN, like queue name of CS, different case of queue name means different queue. I prefer to drop the requirement if there's no strong opinion to do that. Thanks, Wangda
          Hide
          Craig Welch added a comment -

          So, I'm adding code to check whether a user should be able to modify labels (is an admin) and I think that we should be checking the UserGroup information but not executing the operation using "doAs". This is because, ultimately, the process is writing data into hdfs and for permissions reasons I think it should always be written as the same user - the user yarn runs as - if we do the doAs there will be a mishmash of users there, and to have the directory be secure there would need to be a group with rights which contains all the admin users, which is extra overhead (otherwise, it has to be world writable, which tends to compromise the security model...) I think the same is true if we use other datastores down the line for holding the label info - really, our interest in the user it to verify access, but we don't really need or want to perform actions on their behalf (like you would when launching a job, etc), this is not one of those cases. So, I propose enforcing the check but executing whatever changes as the user the process is running under (the resourcemanager/yarn user, basically, just dropping the doAs). This means that entry points will need to do the verification, but that's not really an issue, the already have to be aware to gather the info regarding who the user is / are aware of the need for doAs, now, etc. It means that the user will need to be careful if executing a tool which directly modifies the data in hdfs to do that as an appropriate user, but they already have to do that, it's not a new issue which is being created with this approach (it doesn't really make that any better or worse, imho). Thoughts?

          Show
          Craig Welch added a comment - So, I'm adding code to check whether a user should be able to modify labels (is an admin) and I think that we should be checking the UserGroup information but not executing the operation using "doAs". This is because, ultimately, the process is writing data into hdfs and for permissions reasons I think it should always be written as the same user - the user yarn runs as - if we do the doAs there will be a mishmash of users there, and to have the directory be secure there would need to be a group with rights which contains all the admin users, which is extra overhead (otherwise, it has to be world writable, which tends to compromise the security model...) I think the same is true if we use other datastores down the line for holding the label info - really, our interest in the user it to verify access, but we don't really need or want to perform actions on their behalf (like you would when launching a job, etc), this is not one of those cases. So, I propose enforcing the check but executing whatever changes as the user the process is running under (the resourcemanager/yarn user, basically, just dropping the doAs). This means that entry points will need to do the verification, but that's not really an issue, the already have to be aware to gather the info regarding who the user is / are aware of the need for doAs, now, etc. It means that the user will need to be careful if executing a tool which directly modifies the data in hdfs to do that as an appropriate user, but they already have to do that, it's not a new issue which is being created with this approach (it doesn't really make that any better or worse, imho). Thoughts?
          Hide
          Wangda Tan added a comment -

          Hi Craig,
          I think when RM is running, the solution should be exactly as you described, we should only check if the caller is user on the admin list, and RM will write file itself, by default it's "yarn" user.
          But when RM is not running, and we need execute a tool to directly modify data in store, we cannot use this way. Because the ACL is retrieved from local configuration file, a malicious user can create a configuration to indicate itself is a admin user and use the configuration to launch tool.
          IMHO, I think we don't need check ACL when we running a standalone tool, it will modify the file, and the file directory has permission already (like it belongs yarn user). So HDFS will do the check for us. But we should only run such standalone command as same as the user launches RM.

          Thanks,
          Wangda

          Show
          Wangda Tan added a comment - Hi Craig, I think when RM is running, the solution should be exactly as you described, we should only check if the caller is user on the admin list, and RM will write file itself, by default it's "yarn" user. But when RM is not running, and we need execute a tool to directly modify data in store, we cannot use this way. Because the ACL is retrieved from local configuration file, a malicious user can create a configuration to indicate itself is a admin user and use the configuration to launch tool. IMHO, I think we don't need check ACL when we running a standalone tool, it will modify the file, and the file directory has permission already (like it belongs yarn user). So HDFS will do the check for us. But we should only run such standalone command as same as the user launches RM. Thanks, Wangda
          Hide
          Craig Welch added a comment -

          Good, what you describe wrt the cli is what I was trying to describe, I just might not have been very clear about it. I'm going to go ahead then and make the changes for the service side to match what we've described.

          Show
          Craig Welch added a comment - Good, what you describe wrt the cli is what I was trying to describe, I just might not have been very clear about it. I'm going to go ahead then and make the changes for the service side to match what we've described.
          Hide
          Wangda Tan added a comment -

          Attached updated consolidated patch, named "YARN-796.node-label.consolidate.2.patch", it contains several bug fixes, and support admin changes node label when RM is not running.

          Please feel free to try and review.

          Thanks,
          Wangda

          Show
          Wangda Tan added a comment - Attached updated consolidated patch, named " YARN-796 .node-label.consolidate.2.patch", it contains several bug fixes, and support admin changes node label when RM is not running. Please feel free to try and review. Thanks, Wangda
          Hide
          Wangda Tan added a comment -

          Uploaded a new consolidated patch against latest trunk for you to play.

          Show
          Wangda Tan added a comment - Uploaded a new consolidated patch against latest trunk for you to play.
          Hide
          Wangda Tan added a comment -

          Split and updated all existing patches for YARN-796 against latest trunk, patch dependencies:

                YARN-2493;YARN-2544
                        |          \
                     YARN-2494   YARN-2501;YARN-2502
                        |
                     YARN-2500
                        |
                       YARN-2596
                     /     |     \
            YARN-2598  YARN-2504 YARN-2505
                                     |
                                 YARN-2503
          

          Please kindly review.

          Thanks,
          Wangda

          Show
          Wangda Tan added a comment - Split and updated all existing patches for YARN-796 against latest trunk, patch dependencies: YARN-2493;YARN-2544 | \ YARN-2494 YARN-2501;YARN-2502 | YARN-2500 | YARN-2596 / | \ YARN-2598 YARN-2504 YARN-2505 | YARN-2503 Please kindly review. Thanks, Wangda
          Hide
          Wangda Tan added a comment -

          Attached ver.6 consolidated patch against latest trunk – "YARN-796.node-label.consolidate.6.patch"

          Show
          Wangda Tan added a comment - Attached ver.6 consolidated patch against latest trunk – " YARN-796 .node-label.consolidate.6.patch"
          Hide
          Wangda Tan added a comment -

          Updated latest consolidated patch against trunk. Set patch available to Kick jenkins

          Show
          Wangda Tan added a comment - Updated latest consolidated patch against trunk. Set patch available to Kick jenkins
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12669919/YARN-796.node-label.consolidate.7.patch
          against trunk revision 6fe5c6b.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 31 new or modified test files.

          -1 javac. The applied patch generated 1294 javac compiler warnings (more than the trunk's current 1266 warnings).

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The following test timeouts occurred in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-tools/hadoop-sls hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

          org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5039//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5039//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-yarn-client.html
          Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5039//artifact/PreCommit-HADOOP-Build-patchprocess/diffJavacWarnings.txt
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5039//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12669919/YARN-796.node-label.consolidate.7.patch against trunk revision 6fe5c6b. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 31 new or modified test files. -1 javac . The applied patch generated 1294 javac compiler warnings (more than the trunk's current 1266 warnings). +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The following test timeouts occurred in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-tools/hadoop-sls hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5039//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5039//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-yarn-client.html Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5039//artifact/PreCommit-HADOOP-Build-patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5039//console This message is automatically generated.
          Hide
          Wangda Tan added a comment -

          Attached new patch fixed javac warnings, findbugs and test failures

          Show
          Wangda Tan added a comment - Attached new patch fixed javac warnings, findbugs and test failures
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12669943/YARN-796.node-label.consolidate.8.patch
          against trunk revision 6fe5c6b.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 37 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-tools/hadoop-sls hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

          org.apache.hadoop.mapreduce.lib.input.TestMRCJCFileInputFormat

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5040//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5040//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12669943/YARN-796.node-label.consolidate.8.patch against trunk revision 6fe5c6b. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 37 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-tools/hadoop-sls hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.mapreduce.lib.input.TestMRCJCFileInputFormat +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5040//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5040//console This message is automatically generated.
          Hide
          Wangda Tan added a comment -
          Show
          Wangda Tan added a comment - The failure should be irrelevant to the changes, I found it failed in a recent JIRA as well: https://issues.apache.org/jira/browse/YARN-611?focusedCommentId=14129761&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14129761 . Filed MAPREDUCE-6098 .
          Hide
          Craig Welch added a comment -

          It looks like the FileSystemNodeLabelManager will just append changes to the edit log forever, until it is restarted, is that correct? If so, a long-running cluster with lots of changes could result in a rather large edit log. I think every so many writes (N writes) a recovery should be "forced" to clean up the edit log and consolidate state (do a recover...)

          Show
          Craig Welch added a comment - It looks like the FileSystemNodeLabelManager will just append changes to the edit log forever, until it is restarted, is that correct? If so, a long-running cluster with lots of changes could result in a rather large edit log. I think every so many writes (N writes) a recovery should be "forced" to clean up the edit log and consolidate state (do a recover...)
          Hide
          Wangda Tan added a comment -

          Had an offline discussion with Craig Welch today, based on Craig's comment on YARN-2496: https://issues.apache.org/jira/browse/YARN-2496?focusedCommentId=14143993&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14143993. I think it's better to put here for more discussions.

          A simple summary of the problem is:
          Current queues and nodes have labels, queue may not be able to access all nodes in the cluster, so the headroom might be less than headroom calculated today.
          Today in YARN-2496, headroom caculation changed to headroom = min(headroom, total-resource-of-the-queue-can-access).
          However, this may not enough, application may set label it required (e.g. label-expression = GPU && LARGE_MEMORY). It's better to return headroom according to the label expression of the application to avoid resource deadlock, etc. problems.
          We will have two problems to support this,

          1. There can be thousands of combinations of label expression, it will be a very large calculation amount for headroom when we have many application running and ask for different labels at the same time.
          2. A single application can ask for different label expressions for different containers (like mapper need GPU but reduer not), a single headroom returned by AllocateResponse may not enough.

          Proposed solutions:
          Solution #1:
          Assume a relatively small number of unique label-expression can satisfy most applications. We can add an option in capacity-scheduler.xml, users can add list of label-expressions need pre-calculated, number of such label-expressions should be small (like <= 100 in the whole cluster). NodeLabelManager will update them when node join, leave or label changed.
          And add a new field in AllocateResponse, like Map<LabelExpression(String), Headroom(Resource)> labelExpToHeadroom. We will return the list of precalculated headrooms back to AM, and AM can make decision how to use it.

          Solution #2:
          AM will receive updated nodes (a list of NodeReport) from RM in AllocateResponse, AM itself can figure out how to get headroom of specified label-expression according to updated NMs. This is simpler than #1, but AM side need implement its own logic to support it.

          Hope to get more thoughts about this,

          Thanks,
          Wangda

          Show
          Wangda Tan added a comment - Had an offline discussion with Craig Welch today, based on Craig's comment on YARN-2496 : https://issues.apache.org/jira/browse/YARN-2496?focusedCommentId=14143993&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14143993 . I think it's better to put here for more discussions. A simple summary of the problem is: Current queues and nodes have labels, queue may not be able to access all nodes in the cluster, so the headroom might be less than headroom calculated today. Today in YARN-2496 , headroom caculation changed to headroom = min(headroom, total-resource-of-the-queue-can-access) . However, this may not enough, application may set label it required (e.g. label-expression = GPU && LARGE_MEMORY). It's better to return headroom according to the label expression of the application to avoid resource deadlock, etc. problems. We will have two problems to support this, There can be thousands of combinations of label expression, it will be a very large calculation amount for headroom when we have many application running and ask for different labels at the same time. A single application can ask for different label expressions for different containers (like mapper need GPU but reduer not), a single headroom returned by AllocateResponse may not enough. Proposed solutions: Solution #1: Assume a relatively small number of unique label-expression can satisfy most applications. We can add an option in capacity-scheduler.xml, users can add list of label-expressions need pre-calculated, number of such label-expressions should be small (like <= 100 in the whole cluster). NodeLabelManager will update them when node join, leave or label changed. And add a new field in AllocateResponse, like Map<LabelExpression(String), Headroom(Resource)> labelExpToHeadroom . We will return the list of precalculated headrooms back to AM, and AM can make decision how to use it. Solution #2: AM will receive updated nodes (a list of NodeReport) from RM in AllocateResponse, AM itself can figure out how to get headroom of specified label-expression according to updated NMs. This is simpler than #1, but AM side need implement its own logic to support it. Hope to get more thoughts about this, Thanks, Wangda
          Hide
          Craig Welch added a comment -

          Some additional info regarding the headroom problem - one of the prototypical node label cases is a queue which can access the whole cluster but which also can access a particular label ("a"). A mapreduce job is launched on this queue with an expression limiting it to "a" nodes. It will receive headroom reflecting access to the whole cluster, even though it can only use "a" nodes. This will sometimes result in a deadlock situation where it starts reducers before it should, based on the incorrect (inflated) headroom, and then cannot start mappers in order to complete the map phase, and so is deadlocked. If there are significantly fewer "a" nodes than the total cluster (expected to be a frequent case), during cases of high or full utilization of those nodes (again, desirable and probably typical), this deadlock will occur.

          It is possible to make no change and receive the correct headroom value for a very restricted set of configurations. If queues are restricted to a single label (and not * or "also the whole cluster"), and jobs run with a label expression selecting that single label, they should get the correct headroom values. Unfortunately, this eliminates a great many use cases/cluster configurations, including the one above, which I think it is very importantant to support.

          A couple of additional details regarding Solution 1 above - in addition to the potential to expand the allocate response api to include a map of expresion->headroom values, it is also possible with this approach to return the correct headroom value where it is currently returned for a job with a single expression. So, a scenario I think very likely - which is the first use case above (a queue which can see the whole cluster + a label with "special" nodes, say label "GPU"), with a default label expression of "GPU" (used by the job throughout), running an unmodified mapreduce job (or hive, etc), where no special support for labels has been added to the that component in the platform, the correct headroom will be returned. I think it's important to be able to introduce node label usability in a largely backward compatible way to enable mapreduce & things above to be able to make use of node labels with just configuration/the yarn platform implementation, and this is the solution (of the one's we've considered) which will make this possible.

          Show
          Craig Welch added a comment - Some additional info regarding the headroom problem - one of the prototypical node label cases is a queue which can access the whole cluster but which also can access a particular label ("a"). A mapreduce job is launched on this queue with an expression limiting it to "a" nodes. It will receive headroom reflecting access to the whole cluster, even though it can only use "a" nodes. This will sometimes result in a deadlock situation where it starts reducers before it should, based on the incorrect (inflated) headroom, and then cannot start mappers in order to complete the map phase, and so is deadlocked. If there are significantly fewer "a" nodes than the total cluster (expected to be a frequent case), during cases of high or full utilization of those nodes (again, desirable and probably typical), this deadlock will occur. It is possible to make no change and receive the correct headroom value for a very restricted set of configurations. If queues are restricted to a single label (and not * or "also the whole cluster"), and jobs run with a label expression selecting that single label, they should get the correct headroom values. Unfortunately, this eliminates a great many use cases/cluster configurations, including the one above, which I think it is very importantant to support. A couple of additional details regarding Solution 1 above - in addition to the potential to expand the allocate response api to include a map of expresion->headroom values, it is also possible with this approach to return the correct headroom value where it is currently returned for a job with a single expression. So, a scenario I think very likely - which is the first use case above (a queue which can see the whole cluster + a label with "special" nodes, say label "GPU"), with a default label expression of "GPU" (used by the job throughout), running an unmodified mapreduce job (or hive, etc), where no special support for labels has been added to the that component in the platform, the correct headroom will be returned. I think it's important to be able to introduce node label usability in a largely backward compatible way to enable mapreduce & things above to be able to make use of node labels with just configuration/the yarn platform implementation, and this is the solution (of the one's we've considered) which will make this possible.
          Hide
          Wangda Tan added a comment -

          Uploaded ver.10 patch

          Show
          Wangda Tan added a comment - Uploaded ver.10 patch
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12673025/YARN-796.node-label.consolidate.10.patch
          against trunk revision 16333b4.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 40 new or modified test files.

          -1 javac. The applied patch generated 1268 javac compiler warnings (more than the trunk's current 1267 warnings).

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 16 new Findbugs (version 2.0.3) warnings.

          -1 release audit. The applied patch generated 1 release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-tools/hadoop-sls hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

          org.apache.hadoop.mapred.pipes.TestPipeApplication
          org.apache.hadoop.yarn.api.TestPBImplRecords
          org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodeLabels
          org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched
          org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservationQueue
          org.apache.hadoop.yarn.server.resourcemanager.reservation.TestCapacityReservationSystem
          org.apache.hadoop.yarn.server.resourcemanager.reservation.TestNoOverCommitPolicy
          org.apache.hadoop.yarn.server.resourcemanager.reservation.TestGreedyReservationAgent
          org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestLeafQueue
          org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy
          org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations
          org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation
          org.apache.hadoop.yarn.server.resourcemanager.reservation.TestCapacityOverTimePolicy

          The following test timeouts occurred in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-tools/hadoop-sls hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

          org.apache.hadoop.mapreduce.TestLargeSort
          org.apache.hadoop.yarn.client.TestResourceTrackerOnHA
          org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart
          org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService
          org.apache.hadoop.yarn.server.resourcemanager.TestRMHA

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5268//testReport/
          Release audit warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5268//artifact/patchprocess/patchReleaseAuditProblems.txt
          Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5268//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5268//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html
          Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5268//artifact/patchprocess/diffJavacWarnings.txt
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5268//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673025/YARN-796.node-label.consolidate.10.patch against trunk revision 16333b4. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 40 new or modified test files. -1 javac . The applied patch generated 1268 javac compiler warnings (more than the trunk's current 1267 warnings). +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 16 new Findbugs (version 2.0.3) warnings. -1 release audit . The applied patch generated 1 release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-tools/hadoop-sls hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.mapred.pipes.TestPipeApplication org.apache.hadoop.yarn.api.TestPBImplRecords org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodeLabels org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservationQueue org.apache.hadoop.yarn.server.resourcemanager.reservation.TestCapacityReservationSystem org.apache.hadoop.yarn.server.resourcemanager.reservation.TestNoOverCommitPolicy org.apache.hadoop.yarn.server.resourcemanager.reservation.TestGreedyReservationAgent org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestLeafQueue org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation org.apache.hadoop.yarn.server.resourcemanager.reservation.TestCapacityOverTimePolicy The following test timeouts occurred in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-tools/hadoop-sls hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.mapreduce.TestLargeSort org.apache.hadoop.yarn.client.TestResourceTrackerOnHA org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService org.apache.hadoop.yarn.server.resourcemanager.TestRMHA +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5268//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5268//artifact/patchprocess/patchReleaseAuditProblems.txt Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5268//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5268//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5268//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5268//console This message is automatically generated.
          Hide
          Wangda Tan added a comment -

          Attached ver.11 fixed javac warnings, findbug warnings and test failures.

          Show
          Wangda Tan added a comment - Attached ver.11 fixed javac warnings, findbug warnings and test failures.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12673059/YARN-796.node-label.consolidate.11.patch
          against trunk revision 16333b4.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 42 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-tools/hadoop-sls hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

          org.apache.hadoop.mapred.pipes.TestPipeApplication
          org.apache.hadoop.yarn.api.TestPBImplRecords
          org.apache.hadoop.yarn.nodelabels.TestFileSystemNodeLabelsStore
          org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodeLabels
          org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservationQueue
          org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy
          org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5272//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5272//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5272//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673059/YARN-796.node-label.consolidate.11.patch against trunk revision 16333b4. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 42 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-tools/hadoop-sls hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.mapred.pipes.TestPipeApplication org.apache.hadoop.yarn.api.TestPBImplRecords org.apache.hadoop.yarn.nodelabels.TestFileSystemNodeLabelsStore org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodeLabels org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservationQueue org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5272//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5272//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5272//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12673185/YARN-796.node-label.consolidate.12.patch
          against trunk revision 3affad9.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 41 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-tools/hadoop-sls hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

          org.apache.hadoop.mapred.pipes.TestPipeApplication
          org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5282//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5282//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5282//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673185/YARN-796.node-label.consolidate.12.patch against trunk revision 3affad9. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 41 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-tools/hadoop-sls hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.mapred.pipes.TestPipeApplication org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5282//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5282//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5282//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12673284/YARN-796.node-label.consolidate.13.patch
          against trunk revision 519e5a7.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5298//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673284/YARN-796.node-label.consolidate.13.patch against trunk revision 519e5a7. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5298//console This message is automatically generated.
          Hide
          Wangda Tan added a comment -

          Updated to trunk

          Show
          Wangda Tan added a comment - Updated to trunk
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12673289/YARN-796.node-label.consolidate.13.patch
          against trunk revision 0fb2735.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 42 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 3 new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-tools/hadoop-sls hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

          org.apache.hadoop.mapred.pipes.TestPipeApplication
          org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestLeafQueue
          org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy
          org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5299//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5299//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5299//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5299//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673289/YARN-796.node-label.consolidate.13.patch against trunk revision 0fb2735. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 42 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 3 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-tools/hadoop-sls hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.mapred.pipes.TestPipeApplication org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestLeafQueue org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5299//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5299//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5299//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5299//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12673374/YARN-796.node-label.consolidate.14.patch
          against trunk revision 2e789eb.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 42 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 3 new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-tools/hadoop-sls hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

          org.apache.hadoop.mapred.pipes.TestPipeApplication
          org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestLeafQueue
          org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5307//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5307//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5307//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5307//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673374/YARN-796.node-label.consolidate.14.patch against trunk revision 2e789eb. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 42 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 3 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-tools/hadoop-sls hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.mapred.pipes.TestPipeApplication org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestLeafQueue org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5307//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5307//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5307//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5307//console This message is automatically generated.
          Hide
          Kannan Rajah added a comment -

          Wangda Tan I think we can improve the performance of the load balancing logic in FairScheduler.continuousSchedulingAttempt when Label Based Scheduling is active. I would like to get your input on this. If you believe this is a valid improvement, I would like to work on a proposal and fix. Here is an overview of the current logic.

          for each node (ordered by cap remaining)
            for each schedulable (ordered by fairness)
              if a set of conditions are met
                assign the container to node
          

          Problem:
          When LBS is enabled, the set of conditions will include the label match. A node with maximum capacity remaining may not meet the label criteria. So why bother going through a global set of nodes when only a subset of them can even be used to schedule some applications. The effect could be profound in large cluster with non overlapping node labels. What we really need is to track a set of "sub clusters" and the applications that can be scheduled on them. Within each sub cluster, we will maintain the node ordering by capacity remaining so that the tasks are evenly distributed across nodes.

          for each subcluster
            if there are no applications belonging to it
              continue
          
            for each node in the subcluster (ordered by cap remaining)
              for each schedulable (ordered by fairness)
                if a set of conditions are met
                  assign the container to node
          
          Show
          Kannan Rajah added a comment - Wangda Tan I think we can improve the performance of the load balancing logic in FairScheduler.continuousSchedulingAttempt when Label Based Scheduling is active. I would like to get your input on this. If you believe this is a valid improvement, I would like to work on a proposal and fix. Here is an overview of the current logic. for each node (ordered by cap remaining) for each schedulable (ordered by fairness) if a set of conditions are met assign the container to node Problem: When LBS is enabled, the set of conditions will include the label match. A node with maximum capacity remaining may not meet the label criteria. So why bother going through a global set of nodes when only a subset of them can even be used to schedule some applications. The effect could be profound in large cluster with non overlapping node labels. What we really need is to track a set of "sub clusters" and the applications that can be scheduled on them. Within each sub cluster, we will maintain the node ordering by capacity remaining so that the tasks are evenly distributed across nodes. for each subcluster if there are no applications belonging to it continue for each node in the subcluster (ordered by cap remaining) for each schedulable (ordered by fairness) if a set of conditions are met assign the container to node
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12673374/YARN-796.node-label.consolidate.14.patch
          against trunk revision ca3381d.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6184//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673374/YARN-796.node-label.consolidate.14.patch against trunk revision ca3381d. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6184//console This message is automatically generated.
          Hide
          Wangda Tan added a comment -

          Kannan Rajah,
          Did you mean capacity scheduler? Since currently fair scheduler support node label is still in open state (YARN-2497), or you wanna do it along with the effort of YARN-2497. I can understand your proposal, it is a valid enhancement for sure, but a trade-off is we need add more complexities in scheduling logic (we already have complex scheduler implementation in both fair/capacity scheduler), we need figure out if it is necessary.

          If you plan to do it, I can help with implementation discussion and review.

          Thanks,
          Wandga

          Show
          Wangda Tan added a comment - Kannan Rajah , Did you mean capacity scheduler? Since currently fair scheduler support node label is still in open state ( YARN-2497 ), or you wanna do it along with the effort of YARN-2497 . I can understand your proposal, it is a valid enhancement for sure, but a trade-off is we need add more complexities in scheduling logic (we already have complex scheduler implementation in both fair/capacity scheduler), we need figure out if it is necessary. If you plan to do it, I can help with implementation discussion and review. Thanks, Wandga
          Hide
          Kannan Rajah added a comment -

          No, I did mean fair scheduler because that is the one having the load balancing logic. It does this by sorting the available node capacity and iterating over it one at a time. So with label based scheduling, the node with maximum available capacity need not be OK for the job. I agree that it adds complexity. In fact, we need to add an API to the Scheduler interface that does not take a Node as input. I will draft a proposal and run it by you to see if that makes sense. I will also check with Yulia about YARN-2497 and see if this can be done as part of that.

          Show
          Kannan Rajah added a comment - No, I did mean fair scheduler because that is the one having the load balancing logic. It does this by sorting the available node capacity and iterating over it one at a time. So with label based scheduling, the node with maximum available capacity need not be OK for the job. I agree that it adds complexity. In fact, we need to add an API to the Scheduler interface that does not take a Node as input. I will draft a proposal and run it by you to see if that makes sense. I will also check with Yulia about YARN-2497 and see if this can be done as part of that.
          Hide
          Wangda Tan added a comment -

          I'm not sure what's the purpose of adding a new API to scheduler interface. I think this proposal is a specific enhancement instead of global design discussion, I suggest you can file a ticket under YARN-2492 and we can move discussions to the new JIRA.

          Thanks,
          Wangda

          Show
          Wangda Tan added a comment - I'm not sure what's the purpose of adding a new API to scheduler interface. I think this proposal is a specific enhancement instead of global design discussion, I suggest you can file a ticket under YARN-2492 and we can move discussions to the new JIRA. Thanks, Wangda
          Hide
          Wangda Tan added a comment -

          I'm not sure what's the purpose of adding a new API to scheduler interface. I think this proposal is a specific enhancement instead of global design discussion, I suggest you can file a ticket under YARN-2492 and we can move discussions to the new JIRA.

          Thanks,
          Wangda

          Show
          Wangda Tan added a comment - I'm not sure what's the purpose of adding a new API to scheduler interface. I think this proposal is a specific enhancement instead of global design discussion, I suggest you can file a ticket under YARN-2492 and we can move discussions to the new JIRA. Thanks, Wangda
          Hide
          Wangda Tan added a comment -

          Attached same design doc for YARN-3214 (Non-exclusive node label) to umbrella ticket.

          Show
          Wangda Tan added a comment - Attached same design doc for YARN-3214 (Non-exclusive node label) to umbrella ticket.
          Hide
          Jian Fang added a comment -

          Come back to this issue again since I am trying to merge the latest YARN-796 into our hadoop code base. Seems one thing is missing, i.e., how to specify the labels for application masters? Application master is special and it is the task manager of a specific YARN application. It also has some special requirements for its allocation on a hadoop cluster running in cloud. For example, in Amazon EC2, we do not want any application masters to be launched on any spot instances if we have both spot and on-demand instances available. Yarn-796 should provide a mechanism to achieve this goal.

          Show
          Jian Fang added a comment - Come back to this issue again since I am trying to merge the latest YARN-796 into our hadoop code base. Seems one thing is missing, i.e., how to specify the labels for application masters? Application master is special and it is the task manager of a specific YARN application. It also has some special requirements for its allocation on a hadoop cluster running in cloud. For example, in Amazon EC2, we do not want any application masters to be launched on any spot instances if we have both spot and on-demand instances available. Yarn-796 should provide a mechanism to achieve this goal.
          Hide
          Wangda Tan added a comment -

          Jian Fang,
          The patch attached in this JIRA is staled, instead you should merge patches under YARN-2492.

          For more usage info, you can take a look at http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.2.0/YARN_RM_v22/node_labels/index.html#Item1.1. Specifically to your question, now we support 4 ways to specify labels for applications (CapacityScheduler only for now):
          1) Specify default-node-label-expression in each queue, all containers under the queue will be assigned to label specified
          2) Specify ApplicationSubmissionContext.appLabelExpression, all containers under the app will be assigned to label specified
          3) Specify ApplicationSubmissionContext.amContainerLabelExpression, AM container will be assigned to label specified
          4) Specify ResourceRequest.nodeLabelExpression, individual containers will be assigned to label specified.

          Let me know if you have more questions.

          Show
          Wangda Tan added a comment - Jian Fang , The patch attached in this JIRA is staled, instead you should merge patches under YARN-2492 . For more usage info, you can take a look at http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.2.0/YARN_RM_v22/node_labels/index.html#Item1.1 . Specifically to your question, now we support 4 ways to specify labels for applications (CapacityScheduler only for now): 1) Specify default-node-label-expression in each queue, all containers under the queue will be assigned to label specified 2) Specify ApplicationSubmissionContext.appLabelExpression, all containers under the app will be assigned to label specified 3) Specify ApplicationSubmissionContext.amContainerLabelExpression, AM container will be assigned to label specified 4) Specify ResourceRequest.nodeLabelExpression, individual containers will be assigned to label specified. Let me know if you have more questions.
          Hide
          Jian Fang added a comment -

          Thanks. Seems ApplicationSubmissionContext.amContainerLabelExpression is the one that I am looking for. Will try that to see if it works. Any plans for the fair scheduler? We need that as well.

          Show
          Jian Fang added a comment - Thanks. Seems ApplicationSubmissionContext.amContainerLabelExpression is the one that I am looking for. Will try that to see if it works. Any plans for the fair scheduler? We need that as well.
          Hide
          Wangda Tan added a comment -

          Fair scheduler efforts are tracked by YARN-2497. You can check about plans in that JIRA.

          Thanks,

          Show
          Wangda Tan added a comment - Fair scheduler efforts are tracked by YARN-2497 . You can check about plans in that JIRA. Thanks,
          Hide
          Jian Fang added a comment -

          I took a look at ApplicationSubmissionContext.amContainerLabelExpression and am not sure if I understand the logic correctly. Seems the amContainerLabelExpression was only set in RMWebServices, not the RPC client. How would this value be populated for a regular mapreduce job then?

          Furthermore, as a hadoop service provider, people may want a mechanism to hook in a global label expression for all MR jobs. For example, in EC2, we don't want to launch AMs on any spot instances. It is not a good idea to ask individual users to configure this from their own job clients. Instead, it is preferable to configure this within hadoop itself by the hadoop platform provider.

          Show
          Jian Fang added a comment - I took a look at ApplicationSubmissionContext.amContainerLabelExpression and am not sure if I understand the logic correctly. Seems the amContainerLabelExpression was only set in RMWebServices, not the RPC client. How would this value be populated for a regular mapreduce job then? Furthermore, as a hadoop service provider, people may want a mechanism to hook in a global label expression for all MR jobs. For example, in EC2, we don't want to launch AMs on any spot instances. It is not a good idea to ask individual users to configure this from their own job clients. Instead, it is preferable to configure this within hadoop itself by the hadoop platform provider.
          Hide
          Wangda Tan added a comment -

          Jian Fang,
          To clarify, now MR doesn't support specifying node labels when submitting jobs. Instead, they can configure default-node-label-expression in queues which will run their MR jobs to make all containers allocated in the queue on nodes with specific labels. I'm not sure how you plan to manage MR jobs in EC2 cluster. Will all MR jobs running in a set if queues? If so, you can specify configure default-node-label-expression in these queues to get what you want.

          And also, if you think default-node-label-expression is not enough, you can file a ticket under MAPREDUCE to support labels specifying for MR jobs. We can continue discussion to that ticket.

          Show
          Wangda Tan added a comment - Jian Fang , To clarify, now MR doesn't support specifying node labels when submitting jobs. Instead, they can configure default-node-label-expression in queues which will run their MR jobs to make all containers allocated in the queue on nodes with specific labels. I'm not sure how you plan to manage MR jobs in EC2 cluster. Will all MR jobs running in a set if queues? If so, you can specify configure default-node-label-expression in these queues to get what you want. And also, if you think default-node-label-expression is not enough, you can file a ticket under MAPREDUCE to support labels specifying for MR jobs. We can continue discussion to that ticket.
          Hide
          Jian Fang added a comment -

          Thanks Wangda for your clarification. Unfortunately the queue configuration file is controlled by users, not the hadoop platform provider. We still need a mechanism in Mapreduce to pass in the node label expressions. Will file a ticket under MAPREDUCE. Thanks again.

          Show
          Jian Fang added a comment - Thanks Wangda for your clarification. Unfortunately the queue configuration file is controlled by users, not the hadoop platform provider. We still need a mechanism in Mapreduce to pass in the node label expressions. Will file a ticket under MAPREDUCE. Thanks again.
          Hide
          Jian Fang added a comment -

          JIRA MAPREDUCE-6304 has been created for this purpose.

          Show
          Jian Fang added a comment - JIRA MAPREDUCE-6304 has been created for this purpose.

            People

            • Assignee:
              Wangda Tan
              Reporter:
              Arun C Murthy
            • Votes:
              6 Vote for this issue
              Watchers:
              79 Start watching this issue

              Dates

              • Created:
                Updated:

                Development