Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: resourcemanager
    • Labels:
      None

      Description

      Currently resource manager supports only Memory resource during container allocation.

      I propose following improvements:

      1. add support for CPU utilization. Node CPU used information can be obtained by ResourceCalculatorPlugin.

      2. add support for custom resources. In node configuration will be something like:

      name=node.resource.GPU, value=1 (node has 1 GPU).

      If job will need to use GPU for computation, it will add "GPU=1" requirement to its job config and Resource Manager will allocate container on node with GPU available.

        Issue Links

          Activity

          Hide
          Radim Kolar added a comment -

          Two kind of resources:

          1. cluster shared resources - every node in cluster can use them. For example WAN bandwidth.

          2. node local resources - for example GPU card.

          these resources will be configured using standard hadoop configuration framework.

          mapreduce.resources.cluster.available=resourcename:float,... (this option will be used by resourcemanager).
          mapreduce.resources.node.available=resourcename:float (for use by nodemanager). Node manager will report available resources to resourcemanager

          job config:
          mapreduce.resources.mapper= list of resources needed by one mapper
          mapreduce.resources.reducer= list of resources needed by one reducer

          if same resource is configured per node and per cluster then per node version is preferred.

          Show
          Radim Kolar added a comment - Two kind of resources: 1. cluster shared resources - every node in cluster can use them. For example WAN bandwidth. 2. node local resources - for example GPU card. these resources will be configured using standard hadoop configuration framework. mapreduce.resources.cluster.available=resourcename:float,... (this option will be used by resourcemanager). mapreduce.resources.node.available=resourcename:float (for use by nodemanager). Node manager will report available resources to resourcemanager job config: mapreduce.resources.mapper= list of resources needed by one mapper mapreduce.resources.reducer= list of resources needed by one reducer if same resource is configured per node and per cluster then per node version is preferred.
          Hide
          Radim Kolar added a comment -

          resources will be detectable by plugin. Plugin will work in 2 modes (node wide and cluster wide) and report currently in use resource and max. available. If resource will not have plugin assigned, it will use simple allocation - just hardcode into hadoop config how much resources of that kind you have.

          Show
          Radim Kolar added a comment - resources will be detectable by plugin. Plugin will work in 2 modes (node wide and cluster wide) and report currently in use resource and max. available. If resource will not have plugin assigned, it will use simple allocation - just hardcode into hadoop config how much resources of that kind you have.
          Hide
          Robert Joseph Evans added a comment -

          @Radim,

          I have been trying to think of any resource or resource request that does not really fit into this model. There are a few that I have come up with. I don't know if they should be supported or not. I am including them here mostly for discussion to see what others think.

          1. optional resources. For example I have a mapper that could really be sped up by having access to a GPU card, but it can run without it. So if there is one free I would like to use it, but if it isn't available I am OK without it. This opens up a bit of a can of worms because how do I indicate in the request how much effort would I like to go into giving me access to that resource before giving up.
          2. a subclass of this might be flexible resources. I would love to have 10GB of RAM, it would really make my process run fast, with all of the caching I would like to do, but I can get by with as little as 2GB.
          3. data locality/network accessible resources. I want to be near a given resource but I don't necessarily need to be on the same box as the resource. This could be used when reading from/writing to a filer, DB server, or even a different yet to be scheduled container. This is a bit tricky because we already try to do this, but in a different way. With each container we request a node/rack that we would like to run on. The reality is that at least for map reduce we are requesting an HDFS block that we would like to run near, but there are too many blocks to effectively put it into a resource model. Instead it was generalized, and would probably work for most things, except for the case of I want to be near a yet to be scheduled container. Even that would work, so long as we only launch containers one at a time.
          4. global resources. Things like bandwidth between datacenters for distcp copies. Perhaps these could be modeled more as cluster wide resources. But the APIs would have to be set up assuming that the free/max resources can change without any input from the RM. Meaning the scheduler may need to be able to deal with asking for those resources and having them now, not be available.
          5. default resources. For backwards compatibility with existing requests we may want to have some resources like CPU, that have a default value if not explicitly given.
          6. tiered or dependent resources. With network for example if I am doing a distcp across colos, I am not only using up network bandwidth for the given box that I am on, I am also using up network bandwidth on the rack switch, and the bandwidth going between the colos. I am not a network engineer so if I am completely wrong with the example hopefully the general concept still holds. Do we want to some how be able to tie different resources together so a request for one thing (intra-colo bandwidth) also implies a request for other resources?

          I also have a few questions.

          How are the plug-ins intended to adapt to different operating environments? Determining the amount of free CPU on Windows is very different from doing it on Linux. Would we have different plug-ins for different OS's? Would it be up to the end user to configure it all appropriately or would we try to auto-detect the environment and adjust automatically?

          Do we want to handle enforcement of the resource limits as well? We currently kill off containers that are using more memory than requested (with some leeway). What about other things like CPU or network, do we kill the container if it goes over, do we try to use the OS to restrict the amount used, do we warn the user that they went over, or do we just ignore it and hope everyone is a good citizen?

          Show
          Robert Joseph Evans added a comment - @Radim, I have been trying to think of any resource or resource request that does not really fit into this model. There are a few that I have come up with. I don't know if they should be supported or not. I am including them here mostly for discussion to see what others think. optional resources. For example I have a mapper that could really be sped up by having access to a GPU card, but it can run without it. So if there is one free I would like to use it, but if it isn't available I am OK without it. This opens up a bit of a can of worms because how do I indicate in the request how much effort would I like to go into giving me access to that resource before giving up. a subclass of this might be flexible resources. I would love to have 10GB of RAM, it would really make my process run fast, with all of the caching I would like to do, but I can get by with as little as 2GB. data locality/network accessible resources. I want to be near a given resource but I don't necessarily need to be on the same box as the resource. This could be used when reading from/writing to a filer, DB server, or even a different yet to be scheduled container. This is a bit tricky because we already try to do this, but in a different way. With each container we request a node/rack that we would like to run on. The reality is that at least for map reduce we are requesting an HDFS block that we would like to run near, but there are too many blocks to effectively put it into a resource model. Instead it was generalized, and would probably work for most things, except for the case of I want to be near a yet to be scheduled container. Even that would work, so long as we only launch containers one at a time. global resources. Things like bandwidth between datacenters for distcp copies. Perhaps these could be modeled more as cluster wide resources. But the APIs would have to be set up assuming that the free/max resources can change without any input from the RM. Meaning the scheduler may need to be able to deal with asking for those resources and having them now, not be available. default resources. For backwards compatibility with existing requests we may want to have some resources like CPU, that have a default value if not explicitly given. tiered or dependent resources. With network for example if I am doing a distcp across colos, I am not only using up network bandwidth for the given box that I am on, I am also using up network bandwidth on the rack switch, and the bandwidth going between the colos. I am not a network engineer so if I am completely wrong with the example hopefully the general concept still holds. Do we want to some how be able to tie different resources together so a request for one thing (intra-colo bandwidth) also implies a request for other resources? I also have a few questions. How are the plug-ins intended to adapt to different operating environments? Determining the amount of free CPU on Windows is very different from doing it on Linux. Would we have different plug-ins for different OS's? Would it be up to the end user to configure it all appropriately or would we try to auto-detect the environment and adjust automatically? Do we want to handle enforcement of the resource limits as well? We currently kill off containers that are using more memory than requested (with some leeway). What about other things like CPU or network, do we kill the container if it goes over, do we try to use the OS to restrict the amount used, do we warn the user that they went over, or do we just ignore it and hope everyone is a good citizen?
          Hide
          Radim Kolar added a comment -

          1 would leave optional resources and other stuff like tiered resources and data locality vs optional resources for Resource scheduling for hadoop-3. For hadoop-2 lets start with something simple.

          1. handling different OS - one plugin for each operation system. If plugin runs on unsupported OS, it will throw exception during @PostConstruct method.

          2. Resource restrictions can be supported if resource plugin will have optional function implemented - query how much resource of that type is currently in use by container.

          3. Good point are resources which can change without any input from RM. Lets call them volatile resources - if resources is marked as such, RM will need to do periodic queries.

          4. network accessible resources will be handled as global resources

          Show
          Radim Kolar added a comment - 1 would leave optional resources and other stuff like tiered resources and data locality vs optional resources for Resource scheduling for hadoop-3. For hadoop-2 lets start with something simple. 1. handling different OS - one plugin for each operation system. If plugin runs on unsupported OS, it will throw exception during @PostConstruct method. 2. Resource restrictions can be supported if resource plugin will have optional function implemented - query how much resource of that type is currently in use by container. 3. Good point are resources which can change without any input from RM. Lets call them volatile resources - if resources is marked as such, RM will need to do periodic queries. 4. network accessible resources will be handled as global resources
          Hide
          Robert Joseph Evans added a comment -

          I agree with you that we should start off with something simple. I have no idea if we even want to implement everything that I had in the list. I personally think that we do not want to, but I also want us to fully think about the problem so we don't paint ourselves into a corner.

          Your initial ideas seem very reasonable to me. I think the big thing here is coordinating with what others are trying to do in this area already.

          Andrew Ferguson under MAPREDUCE-4351 is trying to make Container Monitoring Pluggable. It looks like we are trying to split up container monitoring and make it pluggable, so we need work that out.

          Arun is also pushing forward with adding in CPU as a resource under MAPREDUCE-4334.

          Show
          Robert Joseph Evans added a comment - I agree with you that we should start off with something simple. I have no idea if we even want to implement everything that I had in the list. I personally think that we do not want to, but I also want us to fully think about the problem so we don't paint ourselves into a corner. Your initial ideas seem very reasonable to me. I think the big thing here is coordinating with what others are trying to do in this area already. Andrew Ferguson under MAPREDUCE-4351 is trying to make Container Monitoring Pluggable. It looks like we are trying to split up container monitoring and make it pluggable, so we need work that out. Arun is also pushing forward with adding in CPU as a resource under MAPREDUCE-4334 .
          Hide
          Radim Kolar added a comment -

          This ticket has 2 main points.

          1. pluggable resources framework
          2. resource scheduling using pluggable resources

          If pluggable resources are implemented pluggable contained monitoring is not needed, you can not do much other things than kill application using too much resources. Possible extensions can be done by some kind of pluggable policy what to do if resource consumption is exceeded.

          Instead of hacking hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/Resource.java in MAPREDUCE-4327 for cpu cores, its better to add general purpose Resource management.

          Show
          Radim Kolar added a comment - This ticket has 2 main points. 1. pluggable resources framework 2. resource scheduling using pluggable resources If pluggable resources are implemented pluggable contained monitoring is not needed, you can not do much other things than kill application using too much resources. Possible extensions can be done by some kind of pluggable policy what to do if resource consumption is exceeded. Instead of hacking hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/Resource.java in MAPREDUCE-4327 for cpu cores, its better to add general purpose Resource management.
          Hide
          Radim Kolar added a comment -

          About optional resources like GPU for computation. Lets task to specify weight in application resource requests.

          for example request like "gpu:1:.8" means requesting 1 of gpu resources with weight 0.8 -> give node 80% bonus if gpu resource is available

          it could handle data locality as well, because for some compute tasks data locality its not needed because you do not move much data over net. In that case request will be "data:1:0.0"

          Show
          Radim Kolar added a comment - About optional resources like GPU for computation. Lets task to specify weight in application resource requests. for example request like "gpu:1:.8" means requesting 1 of gpu resources with weight 0.8 -> give node 80% bonus if gpu resource is available it could handle data locality as well, because for some compute tasks data locality its not needed because you do not move much data over net. In that case request will be "data:1:0.0"
          Hide
          Radim Kolar added a comment -

          replaced by YARN-2

          Show
          Radim Kolar added a comment - replaced by YARN-2

            People

            • Assignee:
              Unassigned
              Reporter:
              Radim Kolar
            • Votes:
              0 Vote for this issue
              Watchers:
              27 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development