Hadoop Common
  1. Hadoop Common
  2. HADOOP-3421

Requirements for a Resource Manager for Hadoop

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      This is a proposal to extend the scheduling functionality of Hadoop to allow sharing of large clusters without the use of HOD. We're suffering from performance issues with HOD and not finding it the right model for running jobs. We have concluded that a native Hadoop Resource Manager would be more useful to many people if it supported the features we need for sharing clusters across large groups and organizations.

      Below are the key requirements for a Resource Manager for Hadoop. First, some terminology used in this writeup:

      • RM: Resource Manager. What we're building.
      • MR: Map Reduce.
      • A job is an MR job for now, but can be any request. Jobs are submitted by users to the Grid. MR jobs are made up of units of computation called tasks.
      • A grid has a variety of resources of different capacities that are allocated to tasks. For the the early version of the grid, the only resource considered is a Map or Reduce slot, which can execute a task. Each slot can run one or more tasks. Later versions may look at resources such as local temporary storage or CPUs.
      • V1: version 1. Some features are simplified for V1.

      Orgs, queues, users, jobs

      Organizations (Orgs) are distinct entities for administration, configuration, billing and reporting purposes. Users belong to Orgs. Orgs have queues of jobs, where a queue represents a collection of jobs that share some scheduling criteria.

      • 1.1. For V1, each queue will belong to one Org and each Org will have one queue.
      • 1.2. Jobs are submitted to queues. A single job can be submitted to only one queue. It follows that a job will have a user and an Org associated with it.
      • 1.3. A user can belong to multiple Orgs and can potentially submit jobs to multiple queues.
      • 1.4. Orgs are guaranteed a fraction of the capacity of the grid (their 'guaranteed capacity') in the sense that a certain capacity of resources will be at their disposal. All jobs submitted to the queues of an Org will have access to the capacity guaranteed to the Org.
        • Note: it is expected that the sum of the guaranteed capacity of each Org should equal the resources in the Grid. If the sum is lower, some resources will not be used. If the sum is higher, the RM cannot maintain guarantees for all Orgs.
      • 1.5. At any given time, free resources can be allocated to any Org beyond their guaranteed capacity. For example this may be in the proportion of guaranteed capacities of various Orgs or some other way. However, these excess allocated resources can be reclaimed and made available to another Org in order to meet its capacity guarantee.
      • 1.6. N minutes after an org reclaims resources, it should have all its reserved capacity available. Put another way, the system will guarantee that excess resources taken from an Org will be restored to it within N minutes of its need for them.
      • 1.7. Queues have access control. Queues can specify which users are (not) allowed to submit jobs to it. A user's job submission will be rejected if the user does not have access rights to the queue.

      Job capacity

      • 2.1. Users will just submit jobs to the Grid. They do not need to specify the capacity required for their jobs (i.e. how many parallel tasks the job needs). [Most MR jobs are elastic and do not require a fixed number of parallel tasks to run - they can run with as little or as much task parallelism as they can get. This amount of task parallelism is usually limited by the number of mappers required (which is computed by the system and not by the user) or the amount of free resources available in the grid. In most cases, the user wants to just submit a job and let the system take care of utilizing as many or as little resources as it can.]

      Priorities

      • 3.1. Jobs can optionally have priorities associated with them. For V1, we support the same set of priorities available to MR jobs today.
      • 3.2. Queues can optionally support priorities for jobs. By default, a queue does not support priorities, in which case it will ignore (with a warning) any priority levels specified by jobs submitted to it. If a queue does support priorities, it will have a default priority associated with it, which is assigned to jobs that don't have priorities. Reqs 3.1 and 3.2 together mean this: if a queue supports priorities, then a job is assigned the default priority if it doesn't have one specified, else the job's specified priority is used. If a queue does not support priorities, then it ignores priorities specified for jobs.
      • 3.3. Within a queue, jobs with higher priority will have access to the queue's resources before jobs with lower priority. However, once a job is running, it will not be preempted (i.e., stopped and restarted) for a higher priority job. What this also means is that comparison of priorities makes sense within queues, and not across them.

      Fairness/limits

      • 4.1. In order to prevent one or more users from monopolizing its resources, each queue enforces a limit on the percentage of resources allocated to a user at any given time, if there is competition for them. This user limit can vary between a minimum and maximum value. For V1, all users have the same limit, whose maximum value is dictated by the number of users who have submitted jobs, and whose minimum value is a pre-configured value UL-MIN. For example, suppose UL-MIN is 25. If two users have submitted jobs to a queue, no single user can use more than 50% of the queue resources. If a third user submits a job, no single user can use more than 33% of the queue resources. With 4 or more users, no user can use more than 25% of the queue's resources.
        • Limits apply to newer scheduling, i.e., running jobs or tasks will not be preempted.
        • The value of UL-MIN can be set differently per Org.

      Job queue interaction

      • 5.1. Interaction with the Job queue should be through a command line interface and a web-based GUI.
      • 5.2. All queues are visible to all users. The Web UI will provide a single-page view of all queues.
      • 5.3. Users should be able to delete their queued jobs at any time.
      • 5.4. Users should be able to see capacity statistics for various Orgs (what is the capacity allocated, how much is being used, etc.)
      • 5.5. Existing utilities, such as the hadoop job -list command, should be enhanced to show additional attributes that are relevant. For e.g. which queue is associated with the job.

      Accounting

      • 6.1. The RM must provide accounting information in a manner that can be easily consumed by external plug-ins or utilities to integrate with 3rd party accounting systems. The accounting information should comprise of the following information:
        • Username running the Hadoop job,
        • job id,
        • job name,
        • queue to which job was submitted and organization owning the queue,
        • number of resource units (for e.g. slots) used
        • number of maps / reduces,
        • timings - time of entry into the queue, start and end times of the job, perhaps total node hours,
        • status of the job - success, failed, killed, etc.
      • 6.2. To assist deployments which do not require accounting, it should be possible to turn off this feature.

      Availability

      • 7.1 Job state needs to be persisted (RM restarts should not cause jobs to die)

      Scalability

      • 8.1. Scale to 3k+ nodes
      • 8.2. Scale to handle 1k+ large submitted jobs, each with 100k+ tasks

      Configuration

      • 9.1. The system must provide a mechanism to create and delete organizations, and queues within the organizations. It must also provide a mechanism to configure various properties of these objects.
      • 9.2. Only users with administrative privileges can perform operations of managing and configuring these objects in the system.
      • 9.3. Configuration changes must be effective in the RM without requiring its restart. They must be effective in a reasonable amount of time since the modification is made.
      • 9.4. For most of the configurations, it appears that there can be values at various levels - Grid, organization, queue, user and job. For e.g. there can be a default value for the resource quota per user at a Grid level, which can be overridden at an org level, and so on. There must be an easy way to express these configurations in this hierarchical fashion. Also, values at a broader level can be overridden by values at a more narrow level.
      • 9.5. There must be appropriate default objects and default values for their configuration. This is to help deployments that do not need a complex scheduling setup.

      Logging Enhancements

      • 10.1. For purposes of debugging, the Hadoop web UI should provide a facility to see details of all jobs. While this is mostly supported today, any changes to meet other requirements, such as scalability, must not affect this feature. Also, it must be possible to view task logs from Job history UI (see HADOOP:2165)
      • 10.2. The system must log all relevant events about a job vis-a-vis scheduling. Particularly, changes in state of a job (queued -> scheduled -> completed | killed), and events which caused these changes must be logged.
      • 10.3. The system should be able to provide relevant, explanatory information about the status of job to give feedback to users. This could be a diagnostic string such as why the job is queued or why it failed. (For e.g. lack of sufficient resources - how many were asked, how many are available, exceeding user limits, etc). This information must be available to users, as well as in the logs for debugging purposes. It should also be possible to programmatically get this information.
      • 10.4. The host which submitted the job should be part of log messages. This would assist in debugging.

      Security Enhancements

      • 11.1. The RM should provide a mechanism for controlling who can submit jobs to which queue. This could be done using an ACL mechanism that consists of an ordered whitelist and blacklist of users. The order can determine which ACL would apply in case of conflicts.
      • 11.2. The system must provide a mechanism to list users who have administrative control. Only users in this list should be allowed to modify configuration related to the RM, like configuration, setting up objects, etc.
      • 11.3. The system should be able to schedule tasks running on behalf of multiple users concurrently on the same host in a secure manner. Specifically, this should not require any insecure configuration, such as requiring 0777 permissions on directories etc.
      • 11.4. The system must follow the security mechanisms being implemented for Hadoop (HADOOP:1701 and friends).

        Issue Links

          Activity

          Hide
          steve_l added a comment -

          This is very, ambitious. Scheduling/Resource Management is always one of the big contention spots in Grid standards groups, as the scheduler people want fully deterministic "when will this job finish" information, which the Halting Problem says isnt on the table. I have wasted many days in meetings on this topic.

          Some recommendations
          -keep scheduling independent; make it something that people can experiment with new algorithms, history based scheduling, etc. This is a very hard problem.

          -Pay-as-you-go job submission, in which users have a budget of CPU time, has worked well for managing resources since the days of the mainframe. It's a good way of limiting abuse.

          -long haul GUI/tools could be useful; assume the client is on a laptop behind a firewall. This could be a good opportunity to make the client-API RESTy, against the web interface only.

          Also 11.3 Security: you can only isolate work on the same CPU by running each job in a JVM sandbox with no networking or exec() capabilities. That may be something more broadly useful than just the RM.

          Show
          steve_l added a comment - This is very, ambitious. Scheduling/Resource Management is always one of the big contention spots in Grid standards groups, as the scheduler people want fully deterministic "when will this job finish" information, which the Halting Problem says isnt on the table. I have wasted many days in meetings on this topic. Some recommendations -keep scheduling independent; make it something that people can experiment with new algorithms, history based scheduling, etc. This is a very hard problem. -Pay-as-you-go job submission, in which users have a budget of CPU time, has worked well for managing resources since the days of the mainframe. It's a good way of limiting abuse. -long haul GUI/tools could be useful; assume the client is on a laptop behind a firewall. This could be a good opportunity to make the client-API RESTy, against the web interface only. Also 11.3 Security: you can only isolate work on the same CPU by running each job in a JVM sandbox with no networking or exec() capabilities. That may be something more broadly useful than just the RM.
          Hide
          Vivek Ratan added a comment -

          This is the first step in what may very well turn out to be a long, and perhaps somewhat ambitious, effort, but it is sorely needed. The performance penalty of HOD is damaging, and having users specify the number of nodes they want does not seem to be the right model for them to submit jobs. The requirements listed here are an attempt to get started and put us along the right path, and are nowhere close to being a complete set.

          >> the scheduler people want fully deterministic "when will this job finish" information, which the Halting Problem says isnt on the table
          It seems even tougher for MR jobs which are very elastic. Maybe it's good enough for users to get a sense of when their job will run (and if it's not running, why) and see progress. But it certainly is a challenging problem to figure out when a MR job will run. Perhaps statistical prediction might be a good start?

          >> -keep scheduling independent; make it something that people can experiment with new algorithms, history based scheduling, etc. This is a very hard problem.
          Agreed. Scheduling itself is a complex, fascinating area. As you point out, it's important to keep it separate and make it easy to experiment with various algorithms. That's an architectural requirement we fully intend to respect.

          >> Pay-as-you-go job submission, in which users have a budget of CPU time, has worked well for managing resources since the days of the mainframe. It's a good way of limiting abuse.
          Yes. I can easily see a resource management system where user/org resource consumption is tracked over time and we can have quotas or billing or priority changes based on these usage patterns. For now though, to keep things simple and to get started, I think that between guaranteed capacities for Orgs (and dynamic redistribution of free resources), user limits, and job priorities, we can significantly improve utilization of Grid resources while maintaining fairness.

          >> long haul GUI/tools could be useful; assume the client is on a laptop behind a firewall. This could be a good opportunity to make the client-API RESTy, against the web interface only.
          Tools, GUI, and transparency are going to be critical. It's vital that users get a sense of what's going on, get a sense that scheduling is fair, rational, and deterministic, and that they don't just sit around wondering when things will run. I agree that a REST-like Web interface to the job queues makes a lot of sense, especially given that jobs and queues lend themselves very well to be modelled as resources. In fact, it'll be awesome if someone can work out a good Web-based interface to allow users to interact with the queue. It doesn't have to be REST, but REST does feel like the right model here.

          Show
          Vivek Ratan added a comment - This is the first step in what may very well turn out to be a long, and perhaps somewhat ambitious, effort, but it is sorely needed. The performance penalty of HOD is damaging, and having users specify the number of nodes they want does not seem to be the right model for them to submit jobs. The requirements listed here are an attempt to get started and put us along the right path, and are nowhere close to being a complete set. >> the scheduler people want fully deterministic "when will this job finish" information, which the Halting Problem says isnt on the table It seems even tougher for MR jobs which are very elastic. Maybe it's good enough for users to get a sense of when their job will run (and if it's not running, why) and see progress. But it certainly is a challenging problem to figure out when a MR job will run. Perhaps statistical prediction might be a good start? >> -keep scheduling independent; make it something that people can experiment with new algorithms, history based scheduling, etc. This is a very hard problem. Agreed. Scheduling itself is a complex, fascinating area. As you point out, it's important to keep it separate and make it easy to experiment with various algorithms. That's an architectural requirement we fully intend to respect. >> Pay-as-you-go job submission, in which users have a budget of CPU time, has worked well for managing resources since the days of the mainframe. It's a good way of limiting abuse. Yes. I can easily see a resource management system where user/org resource consumption is tracked over time and we can have quotas or billing or priority changes based on these usage patterns. For now though, to keep things simple and to get started, I think that between guaranteed capacities for Orgs (and dynamic redistribution of free resources), user limits, and job priorities, we can significantly improve utilization of Grid resources while maintaining fairness. >> long haul GUI/tools could be useful; assume the client is on a laptop behind a firewall. This could be a good opportunity to make the client-API RESTy, against the web interface only. Tools, GUI, and transparency are going to be critical. It's vital that users get a sense of what's going on, get a sense that scheduling is fair, rational, and deterministic, and that they don't just sit around wondering when things will run. I agree that a REST-like Web interface to the job queues makes a lot of sense, especially given that jobs and queues lend themselves very well to be modelled as resources. In fact, it'll be awesome if someone can work out a good Web-based interface to allow users to interact with the queue. It doesn't have to be REST, but REST does feel like the right model here.
          Hide
          Doug Cutting added a comment -

          Building a generic job manager, useful beyond mapreduce, without having uses beyond mapreduce seems like a doomed approach. The goal should either be narrowed to providing a better map-reduce specific job manager (i.e., an enhanced jobtracker), or broadened to be generic cluster-management tools that can be used by, e.g., hdfs, mapreduce, hbase, etc. In this latter case, an HDFS deployment might be a long-running high-priority job, where nodes would only be added and removed manually by an administrator. A mapreduce session might be a relatively short-running job whose nodes are added and removed dynamically as the job runs. For example, nodes might be removed both by the job itself as reduces complete, and by the cluster, if higher-priority jobs are launched. An hbase job would probably look more like an HDFS job.

          Show
          Doug Cutting added a comment - Building a generic job manager, useful beyond mapreduce, without having uses beyond mapreduce seems like a doomed approach. The goal should either be narrowed to providing a better map-reduce specific job manager (i.e., an enhanced jobtracker), or broadened to be generic cluster-management tools that can be used by, e.g., hdfs, mapreduce, hbase, etc. In this latter case, an HDFS deployment might be a long-running high-priority job, where nodes would only be added and removed manually by an administrator. A mapreduce session might be a relatively short-running job whose nodes are added and removed dynamically as the job runs. For example, nodes might be removed both by the job itself as reduces complete, and by the cluster, if higher-priority jobs are launched. An hbase job would probably look more like an HDFS job.
          Hide
          Vivek Ratan added a comment -

          Doug, these are valid points. Agreed that building a generic Resource Manager now, without appropriate use cases, is not a good idea. A focus on the scheduling needs for MapReduce, especially in the early versions, is vital. For V1, most changes would likely be in enhancing the JT. However, we're also looking to get our abstractions and architectural boundaries right, so that in the future, the system can potentially support non-MR jobs, while continuing to deal with MR jobs effectively. I believe this can be done. Like you said, in the long term, an MR job is given resources, or resources are taken away from it, dynamically. That's exactly what would be desirable - a fairly elastic resource management system that can deal with dynamic and frequent resource (de)allocations for a job if needed, support resource constraints such as data/rack locality, and maximize utilization of resources. There'll clearly be many steps along the way, but for now, it seems to make sense to focus on handling the scheduling needs of MR jobs.

          A lot of the scheduling code already works well in the JT, and we've been achieving decent data locality, so it'd be silly to throw that away. Adding support for queues, orgs, and quotas to this code makes sense to me.

          Show
          Vivek Ratan added a comment - Doug, these are valid points. Agreed that building a generic Resource Manager now, without appropriate use cases, is not a good idea. A focus on the scheduling needs for MapReduce, especially in the early versions, is vital. For V1, most changes would likely be in enhancing the JT. However, we're also looking to get our abstractions and architectural boundaries right, so that in the future, the system can potentially support non-MR jobs, while continuing to deal with MR jobs effectively. I believe this can be done. Like you said, in the long term, an MR job is given resources, or resources are taken away from it, dynamically. That's exactly what would be desirable - a fairly elastic resource management system that can deal with dynamic and frequent resource (de)allocations for a job if needed, support resource constraints such as data/rack locality, and maximize utilization of resources. There'll clearly be many steps along the way, but for now, it seems to make sense to focus on handling the scheduling needs of MR jobs. A lot of the scheduling code already works well in the JT, and we've been achieving decent data locality, so it'd be silly to throw that away. Adding support for queues, orgs, and quotas to this code makes sense to me.
          Hide
          Doug Cutting added a comment -

          Vivek, have you looked at HADOOP-3412? It looks like a step towards the goals here.

          Show
          Doug Cutting added a comment - Vivek, have you looked at HADOOP-3412 ? It looks like a step towards the goals here.
          Hide
          Brice Arnould added a comment - - edited

          I would love to help you getting ride of HOD !

          I think that the scheduling part will be easy to implement, but it raises other concerns that we might need to discuss. When they'll be solved, the rest could take less than a week.
          Those things are :

          • Assigning organisation to users, and authenticating users.
            Since we will need support in HDFS for authentication, a way to do that could be to delegate the authentification to the FS. A user's JobClient would confirm the fact he made a request by writing a cookie in a file owned by that user before submiting the job.
            Organisations would be described by files owned by the organisation and containing the list of their users.
            A little class "FsGateKeeper" would wrap around that mecanism in a convenient way.
            The problem is that I'm really unaware of what's being done on the HDFS side. I heard about it on the mailing list but that's all.
          • Insulating tasks from each other.
            If we want to allow different tasks from different organisations and to do resource allocation enforcement, we might want to ensure that they have only the needed privileges. It is a more complex issue.
            A first, almost good enough, step could be to write an basic "Insulator" that would use a provision of local accounts (users hadoopClient0 to N), and run each tasks with that users credential. On Linux, CPUsets could be used to reliably clean the working space after the end of a task.
            A more powerful solution would be build a sandboxing daemon around seccomp(). That kind of solution would be light and could be useful to a lot of other projects but would take a lot of time (if someone want more information about that I think that I have a file about this idea somewhere :-P).
          • Balancing the different kind of resources.
            I am working on that with Jaideep. I hope that we will have a first concrete proposition and a prototype by Tuesday.

          If you agree with that vision, I'll create issues for each of those steps.
          If we work together we might be able to do it quickly ^^ (I think it should be less than a month, but I tend to be very optimistic xD).

          PS: I have problems with conditionals in English, and I'm unsure of using "might" "should" and "could" in the right way. Please forbid me.

          Show
          Brice Arnould added a comment - - edited I would love to help you getting ride of HOD ! I think that the scheduling part will be easy to implement, but it raises other concerns that we might need to discuss. When they'll be solved, the rest could take less than a week. Those things are : Assigning organisation to users, and authenticating users. Since we will need support in HDFS for authentication, a way to do that could be to delegate the authentification to the FS. A user's JobClient would confirm the fact he made a request by writing a cookie in a file owned by that user before submiting the job. Organisations would be described by files owned by the organisation and containing the list of their users. A little class "FsGateKeeper" would wrap around that mecanism in a convenient way. The problem is that I'm really unaware of what's being done on the HDFS side. I heard about it on the mailing list but that's all. Insulating tasks from each other. If we want to allow different tasks from different organisations and to do resource allocation enforcement, we might want to ensure that they have only the needed privileges. It is a more complex issue. A first, almost good enough, step could be to write an basic "Insulator" that would use a provision of local accounts (users hadoopClient0 to N), and run each tasks with that users credential. On Linux, CPUsets could be used to reliably clean the working space after the end of a task. A more powerful solution would be build a sandboxing daemon around seccomp(). That kind of solution would be light and could be useful to a lot of other projects but would take a lot of time (if someone want more information about that I think that I have a file about this idea somewhere :-P). Balancing the different kind of resources. I am working on that with Jaideep. I hope that we will have a first concrete proposition and a prototype by Tuesday. If you agree with that vision, I'll create issues for each of those steps. If we work together we might be able to do it quickly ^^ (I think it should be less than a month, but I tend to be very optimistic xD). PS: I have problems with conditionals in English, and I'm unsure of using "might" "should" and "could" in the right way. Please forbid me.
          Hide
          Vivek Ratan added a comment -

          I have opened a Jira (HADOOP-3444) to track the implementation of these requirements.

          Show
          Vivek Ratan added a comment - I have opened a Jira ( HADOOP-3444 ) to track the implementation of these requirements.
          Hide
          Vivek Ratan added a comment -

          Brice, good points on the security and access control. We should open a separate Jira (which can be tracked under HADOOP-3444) to deal with these. Also, we should link your work in HADOOP-3412 to a Jira that looks at the core Scheduling enhancements for the JT.

          Show
          Vivek Ratan added a comment - Brice, good points on the security and access control. We should open a separate Jira (which can be tracked under HADOOP-3444 ) to deal with these. Also, we should link your work in HADOOP-3412 to a Jira that looks at the core Scheduling enhancements for the JT.
          Hide
          Benjamin Reed added a comment -

          Chris and I were discussing this issue this morning. Although the description is phrased as a list of requirements, in reality it is much more a design.

          Perhaps it would be good to agree on the requirements that are motivating this design. We may find that there are some requirements that aren't that important that make the design more complicated than it needs to be.

          Reading between the lines it sounds like you have the following requirements:

          These are the requirements that seem basic:

          • Guarantee a certain level of service to users - minimum number of cpus available and a bound on the time needed to acquire the cpus.
          • Keep grid utilization high - enable users to use more than their guaranteed level in a fair manner if extra resources are available .
          • Some jobs are more important than others, so should have a priority.
          • The resource manager figures out the best number of machines to allocate to a job.
          • Basic accounting for the resources used by a user

          These seem to be motivating your design, but I can't understand why they are needed. (Perhaps I'm misreading the implied requirement.)

          • Even if resources are available and fairness is met still limit the number of resources used by a job (4.1)
          • Abstractions of organizations and queues are needed for ... (Is this an ops requirement? or a client usability requirement?)

          There are probably other requirements we are missing that you have in mind. (Perhaps they would clear up the last two.) It would be great to get them documented.

          Show
          Benjamin Reed added a comment - Chris and I were discussing this issue this morning. Although the description is phrased as a list of requirements, in reality it is much more a design. Perhaps it would be good to agree on the requirements that are motivating this design. We may find that there are some requirements that aren't that important that make the design more complicated than it needs to be. Reading between the lines it sounds like you have the following requirements: These are the requirements that seem basic: Guarantee a certain level of service to users - minimum number of cpus available and a bound on the time needed to acquire the cpus. Keep grid utilization high - enable users to use more than their guaranteed level in a fair manner if extra resources are available . Some jobs are more important than others, so should have a priority. The resource manager figures out the best number of machines to allocate to a job. Basic accounting for the resources used by a user These seem to be motivating your design, but I can't understand why they are needed. (Perhaps I'm misreading the implied requirement.) Even if resources are available and fairness is met still limit the number of resources used by a job (4.1) Abstractions of organizations and queues are needed for ... (Is this an ops requirement? or a client usability requirement?) There are probably other requirements we are missing that you have in mind. (Perhaps they would clear up the last two.) It would be great to get them documented.
          Hide
          Doug Cutting added a comment -

          I wish folks could resist providing huge descriptions for issues. The description field should contain a short description of the problem to be solved. Elaborations of the problem and proposed solutions should be developed as part of the discussion in the comments. The description gets appended to every message about the issue. I'll add something to the wiki about this...

          Show
          Doug Cutting added a comment - I wish folks could resist providing huge descriptions for issues. The description field should contain a short description of the problem to be solved. Elaborations of the problem and proposed solutions should be developed as part of the discussion in the comments. The description gets appended to every message about the issue. I'll add something to the wiki about this...
          Hide
          Vivek Ratan added a comment -

          Sorry Doug. Wasn't aware. I'll keep it in mind next time.

          Show
          Vivek Ratan added a comment - Sorry Doug. Wasn't aware. I'll keep it in mind next time.
          Hide
          Vivek Ratan added a comment -

          >> Although the description is phrased as a list of requirements, in reality it is much more a design.

          IMO, these really are requirements. It's not very useful to say, for example, that a requirement is that a system be highly available. Yes, there are different ways to make it highly available, but it seems perfectly fine to provide some more detail if that's how you want the system to behave as a black box. Otherwise the requirements don't add a lot of value - they end up being fairly obvious. What's more important, again IMO, is that requirement be phrased in terms of usage, i.e., they should specify how external systems (users, etc) interact with the system. And it's important to specify in detail how user interaction will be affected. By explicitly mentioning 'guaranteed capacities' for Orgs and how excess capacity is redistributed and reclaimed, we're letting users know that that is exactly how the system will behave. By just saying the the utilization should be high, you open the door for the architecture/design to influence basic user interaction, which I don't think is right. There have also been some pains taken to make sure we don't talk about Hadoop components specifically, or that the requirements don't influence the design.

          >> Even if resources are available and fairness is met still limit the number of resources used by a job (4.1)

          Not sure I fully understand your concern. If resources are available, limits will not be enforced. The end of the first sentence for 4.1 says "if there is competition for them". User limits are applied only if there are more tasks than queue resources. Maybe that's not fully clear in the writeup, but that's the intention.

          >> Abstractions of organizations and queues are needed for ... (Is this an ops requirement? or a client usability requirement?)

          It's a bit of both. Orgs let us reason about policies, quotas, access control, accounting etc. Queues give you granularity within Orgs to support different scheduling policies. These abstractions affect client usability as users need to knwo which queue to submit jobs to, and hence need to be aware and in agreement with the queue's (and hence the Org's) policies.

          Show
          Vivek Ratan added a comment - >> Although the description is phrased as a list of requirements, in reality it is much more a design. IMO, these really are requirements. It's not very useful to say, for example, that a requirement is that a system be highly available. Yes, there are different ways to make it highly available, but it seems perfectly fine to provide some more detail if that's how you want the system to behave as a black box. Otherwise the requirements don't add a lot of value - they end up being fairly obvious. What's more important, again IMO, is that requirement be phrased in terms of usage, i.e., they should specify how external systems (users, etc) interact with the system. And it's important to specify in detail how user interaction will be affected. By explicitly mentioning 'guaranteed capacities' for Orgs and how excess capacity is redistributed and reclaimed, we're letting users know that that is exactly how the system will behave. By just saying the the utilization should be high, you open the door for the architecture/design to influence basic user interaction, which I don't think is right. There have also been some pains taken to make sure we don't talk about Hadoop components specifically, or that the requirements don't influence the design. >> Even if resources are available and fairness is met still limit the number of resources used by a job (4.1) Not sure I fully understand your concern. If resources are available, limits will not be enforced. The end of the first sentence for 4.1 says "if there is competition for them". User limits are applied only if there are more tasks than queue resources. Maybe that's not fully clear in the writeup, but that's the intention. >> Abstractions of organizations and queues are needed for ... (Is this an ops requirement? or a client usability requirement?) It's a bit of both. Orgs let us reason about policies, quotas, access control, accounting etc. Queues give you granularity within Orgs to support different scheduling policies. These abstractions affect client usability as users need to knwo which queue to submit jobs to, and hence need to be aware and in agreement with the queue's (and hence the Org's) policies.
          Hide
          Benjamin Reed added a comment -

          Why are you using orgs to reason about access control? Queues have ACLs. It seems strange to have two different access checks. It's also not clear why accounting at the queue level isn't enough. The policy and accounting also seem to be done at the queue. You may want orgs for accounting rollup, but you can do that outside the resource manager. Orgs are also often hierarchical, but it seems overkill to put it in the RM when you can easily rollup the accounting outside of the RM.

          I'm not saying it's wrong to have orgs, but I don't see the requirement motivating the design. (You've stated it as a requirements, so perhaps I should say I don't see the motivation for the requirement.)

          I have similar thoughts about priorities. It seems reasonable to have priorities associated with a queue, so why do you also need it with a job? Some designs would define job priority as the queue the job is in.

          I realize you see some things as obvious, but these obvious requirements aren't clear from your design. It is interesting you bring up "highly available". It appears that HA is not a requirement. Correct? Your only availability requirement, which is in the parenthesis, is that a server can restart. If a server fails, the system can hang until it is restarted. Correct? The requirement under Availability is more of a persistence guarantee.

          Show
          Benjamin Reed added a comment - Why are you using orgs to reason about access control? Queues have ACLs. It seems strange to have two different access checks. It's also not clear why accounting at the queue level isn't enough. The policy and accounting also seem to be done at the queue. You may want orgs for accounting rollup, but you can do that outside the resource manager. Orgs are also often hierarchical, but it seems overkill to put it in the RM when you can easily rollup the accounting outside of the RM. I'm not saying it's wrong to have orgs, but I don't see the requirement motivating the design. (You've stated it as a requirements, so perhaps I should say I don't see the motivation for the requirement.) I have similar thoughts about priorities. It seems reasonable to have priorities associated with a queue, so why do you also need it with a job? Some designs would define job priority as the queue the job is in. I realize you see some things as obvious, but these obvious requirements aren't clear from your design. It is interesting you bring up "highly available". It appears that HA is not a requirement. Correct? Your only availability requirement, which is in the parenthesis, is that a server can restart. If a server fails, the system can hang until it is restarted. Correct? The requirement under Availability is more of a persistence guarantee.
          Hide
          Vivek Ratan added a comment -

          Ben, these are valid concerns you raise.

          The relationship between Orgs and queues, at least in the way we've been discussing it between a few of us, is that Orgs 'control' Grid resources ('guaranteed capacity' is one way to do that). And users belong to Orgs. Queues are a way for an an Org to divide its resources. We see an Org eventually having multiple queues. There may be a queue for very high priority jobs, which may get first dibs on an Org's resources, for example.You can imagine other kinds of queues as well, based on other scheduling criteria. Access control happens at both levels. Users can only submit jobs to the queues in their Orgs, and can submit to more than one queue. Individual queues may also control who is allowed to use them. Basically, Orgs and queues provide a hierarchy that we hope helps with configuration (you can have some configuration of policies at Org levels that applies to all queues, and some of these policies may be overridden at the queue levels). There are other ways to do this. You could just have queues in the system, but configuration and management would probably be harder when compared to a more hierarchical structure.

          It's fair to argue whether we need any hierarchy at all, and if we do, why just restrict it to Orgs and queues. Why can't queues have sub-queues, for example? My personal view is that these things will become more clear as we start using this system. It intuitively feels right to have Orgs and queues right now, and there are a number of use cases to justify it, some of which I've mentioned in this discussion. But, partly in deference to the fact that it may be too premature to work out the exact details between Orgs, queues, sub-queues, and what not, we'd like to build V1 with one queue per Org, basically equating the two. It's quite possible, and as you point out somewhat, that Orgs manifest themselves more outside the RM (in fact, in the design for V1, we really just deal with queues). I expect that over the course of the next many weeks/months, we'll be able to define a more concrete relationship between all these entities. IMO, it's premature to spend too much time on that for V1.

          Regarding priorities: queues only decide if they look at job priorities. Jobs always set their own priorities. If a queue respects priorities, it will order jobs based on job priorities. If it does not, it will order jobs based on when they were submitted.

          Regarding HA: there are lots of ways to build HA into the system. What we're suggesting for V1 is for the ability to restart the RM and continue from where it left off (i.e., users do not have to resubmit jobs). This adds a level of HA to the existing HOD system today. So rather than say 'HA is a requirement' (which, as I've mentioned earlier, does not buy us anything), we're specifying exactly what is a requirement in V1 to improve the system's HA. as a user, you're guaranteed that your jobs will persist after you submit them. That is the HA-related requirement this system will enforce. Yes, this is technically a guarantee of persistence, but that in turn gets you some level of availability. There is also some level of HA built into the JTs and TTs today, which will continue. if you're concerned whether Req 7.1 technically falls under 'HA', that's a different discussion. I think it does, but I also think it doesn't matter much whether you group it under 'Availability' or 'Persistence' or something else. But, you might differ.

          Show
          Vivek Ratan added a comment - Ben, these are valid concerns you raise. The relationship between Orgs and queues, at least in the way we've been discussing it between a few of us, is that Orgs 'control' Grid resources ('guaranteed capacity' is one way to do that). And users belong to Orgs. Queues are a way for an an Org to divide its resources. We see an Org eventually having multiple queues. There may be a queue for very high priority jobs, which may get first dibs on an Org's resources, for example.You can imagine other kinds of queues as well, based on other scheduling criteria. Access control happens at both levels. Users can only submit jobs to the queues in their Orgs, and can submit to more than one queue. Individual queues may also control who is allowed to use them. Basically, Orgs and queues provide a hierarchy that we hope helps with configuration (you can have some configuration of policies at Org levels that applies to all queues, and some of these policies may be overridden at the queue levels). There are other ways to do this. You could just have queues in the system, but configuration and management would probably be harder when compared to a more hierarchical structure. It's fair to argue whether we need any hierarchy at all, and if we do, why just restrict it to Orgs and queues. Why can't queues have sub-queues, for example? My personal view is that these things will become more clear as we start using this system. It intuitively feels right to have Orgs and queues right now, and there are a number of use cases to justify it, some of which I've mentioned in this discussion. But, partly in deference to the fact that it may be too premature to work out the exact details between Orgs, queues, sub-queues, and what not, we'd like to build V1 with one queue per Org, basically equating the two. It's quite possible, and as you point out somewhat, that Orgs manifest themselves more outside the RM (in fact, in the design for V1, we really just deal with queues). I expect that over the course of the next many weeks/months, we'll be able to define a more concrete relationship between all these entities. IMO, it's premature to spend too much time on that for V1. Regarding priorities: queues only decide if they look at job priorities. Jobs always set their own priorities. If a queue respects priorities, it will order jobs based on job priorities. If it does not, it will order jobs based on when they were submitted. Regarding HA: there are lots of ways to build HA into the system. What we're suggesting for V1 is for the ability to restart the RM and continue from where it left off (i.e., users do not have to resubmit jobs). This adds a level of HA to the existing HOD system today. So rather than say 'HA is a requirement' (which, as I've mentioned earlier, does not buy us anything), we're specifying exactly what is a requirement in V1 to improve the system's HA. as a user, you're guaranteed that your jobs will persist after you submit them. That is the HA-related requirement this system will enforce. Yes, this is technically a guarantee of persistence, but that in turn gets you some level of availability. There is also some level of HA built into the JTs and TTs today, which will continue. if you're concerned whether Req 7.1 technically falls under 'HA', that's a different discussion. I think it does, but I also think it doesn't matter much whether you group it under 'Availability' or 'Persistence' or something else. But, you might differ.
          Hide
          Benjamin Reed added a comment -

          I think I may have taken my main point off track with examples. To give good design feedback we need the base requirements. It looks like you have already moved beyond that stage, so the point is moot.

          Show
          Benjamin Reed added a comment - I think I may have taken my main point off track with examples. To give good design feedback we need the base requirements. It looks like you have already moved beyond that stage, so the point is moot.
          Hide
          Vivek Ratan added a comment -

          Regarding Reqs 3.1 and 3.2: In Hadoop today, if a user does not assign a priority to a job when the job is submitted, the system assigns a default job priority. JobConf::getJobPriority() returns JobPriority.NORMAL if no priority has been set in the job's config file. This means that once a job is read into Hadoop, it always has a priority, as it is implemented today. That seems reasonable enough to me. In order not to change things around, the two reqs should be modified as follows:

          3.1. Jobs have priorities associated with them (users can optionally assign a priority to a job, or else the system assigns a default priority). For V1, we support the same set of priorities available to MR jobs today.
          3.2. Queues can optionally support priorities for jobs. By default, a queue does not support priorities, in which case it will ignore (with a warning) any priority levels specified by jobs submitted to it. If a queue does support priorities, it will respect the priority set for the job.

          I realize that we're tweaking a requirement based on current implementation, but spirit of the requirement was that we continue letting users submit jobs as before, and the system's behavior does not change. I think that is captured better with the newer set of requirements.

          Show
          Vivek Ratan added a comment - Regarding Reqs 3.1 and 3.2: In Hadoop today, if a user does not assign a priority to a job when the job is submitted, the system assigns a default job priority. JobConf::getJobPriority() returns JobPriority.NORMAL if no priority has been set in the job's config file. This means that once a job is read into Hadoop, it always has a priority, as it is implemented today. That seems reasonable enough to me. In order not to change things around, the two reqs should be modified as follows: 3.1 . Jobs have priorities associated with them (users can optionally assign a priority to a job, or else the system assigns a default priority). For V1, we support the same set of priorities available to MR jobs today. 3.2 . Queues can optionally support priorities for jobs. By default, a queue does not support priorities, in which case it will ignore (with a warning) any priority levels specified by jobs submitted to it. If a queue does support priorities, it will respect the priority set for the job. I realize that we're tweaking a requirement based on current implementation, but spirit of the requirement was that we continue letting users submit jobs as before, and the system's behavior does not change. I think that is captured better with the newer set of requirements.
          Hide
          Hong Tang added a comment -

          Queues are just artifacts of the design of a scheduler. The requirements do not include how jobs/resources are prioritized across queues.

          IMO, it might be cleaner to have one scheduler for each org, and the interface of the scheduler consists of two API: "scheduleIn(Job, Priority)" and "Job scheduleOut()", where Priority is an opaque object specific to the particular scheduler used by org. Having one queue, multiple queues, what kind of queues is a decision inside the implementation of a particular type of scheduler.

          The good thing about this design is that it decouples the scheduling from the already ambitious requirements. We can start with FIFO scheduler or strict priority scheduler and evolve in the long run.

          Show
          Hong Tang added a comment - Queues are just artifacts of the design of a scheduler. The requirements do not include how jobs/resources are prioritized across queues. IMO, it might be cleaner to have one scheduler for each org, and the interface of the scheduler consists of two API: "scheduleIn(Job, Priority)" and "Job scheduleOut()", where Priority is an opaque object specific to the particular scheduler used by org. Having one queue, multiple queues, what kind of queues is a decision inside the implementation of a particular type of scheduler. The good thing about this design is that it decouples the scheduling from the already ambitious requirements. We can start with FIFO scheduler or strict priority scheduler and evolve in the long run.
          Hide
          Vivek Ratan added a comment -

          >> Queues are just artifacts of the design of a scheduler.
          Maybe I'm nitpicking here, and probably this is not very important, but queues are a requirement in this proposal, and queues, as mentioned in the requirements (which, hereforth, I will enclose in quotes) , are different from data-structure queues. A 'queue' is where users submit a job and is a user-facing feature. In the design, you may choose to implement 'queues' with whatever data structure you want. But there is something called a 'queue' which accepts user jobs, which supports access control, etc. Maybe calling it a job repository or some such thing helps? If there's confusion here, maybe it's to do with what people think should go in requirements and what should go in architecture/design?

          >> Having one queue, multiple queues, what kind of queues is a decision inside the implementation of a particular type of scheduler.
          Again, there is a difference between a 'queue' in the requirements and queues in implementation. The way the RM is visualized, you configure a Hadoop installation to support one or more 'queues' (post-V1, you'd likely have one or more 'queues' grouped into an Org), and you configure a bunch of stuff for each 'queue': access control, capacity, etc (see HADOOP-3479 for details). You can certainly configure your system to have as many, or as few, 'queues' as you want, and I don't see them as directly linked to the Scheduler implementation. There are certain queue configuration values that may affect scheduling decisions: for example, you may want to give more weight to some 'queues', so they have greater access to free resources. But it seems like you're suggesting a much stronger connection between the number of queues and the scheduler implementation. If so, can you elaborate? Perhaps a use case would help?

          >> The requirements do not include how jobs/resources are prioritized across queues.
          Req 3.3 says that "comparison of priorities makes sense within queues, and not across them". what this means is that it doesn't make sense for jobs to be prioritized across queues. Since a user submits a job to a particular queue, he/she really only cares where the job ends up in that queue. Maybe what you're asking is how are free resources (Map/Reduce slots, in V1) distributed across queues. For example, if a TT has a free Map slot, which queue does the RM look at? The requirements don't address that (maybe they should). There are a few options for implementation, which are discussed in HADOOP-3445 (see point #2 under the section 'When a TT has a free Map slot'). If you have some other ideas in this respect, I'd love to hear them on that Jira.

          >> The good thing about this design is that it decouples the scheduling from the already ambitious requirements. We can start with FIFO scheduler or strict priority scheduler and evolve in the long run.
          Exactly. It's important to keep the Scheduler piece (the component that decides which free resource to be allocated to which job/task) separate enough. Some scheduling algos work well with a full view of the system. Some may work well by looking only at their own Org, as you suggest. For now, the Scheduler algorithm continues to be the same - use priorities and FIFO to sort jobs, then use data locality and some other stuff to decide which task to assign the TT. This seems to be working reasonably well (we've been getting good data locality hits) so far, but can evolve if needed.

          I hope I didn't go off on a tangent here.

          Show
          Vivek Ratan added a comment - >> Queues are just artifacts of the design of a scheduler. Maybe I'm nitpicking here, and probably this is not very important, but queues are a requirement in this proposal, and queues, as mentioned in the requirements (which, hereforth, I will enclose in quotes) , are different from data-structure queues. A 'queue' is where users submit a job and is a user-facing feature. In the design, you may choose to implement 'queues' with whatever data structure you want. But there is something called a 'queue' which accepts user jobs, which supports access control, etc. Maybe calling it a job repository or some such thing helps? If there's confusion here, maybe it's to do with what people think should go in requirements and what should go in architecture/design? >> Having one queue, multiple queues, what kind of queues is a decision inside the implementation of a particular type of scheduler. Again, there is a difference between a 'queue' in the requirements and queues in implementation. The way the RM is visualized, you configure a Hadoop installation to support one or more 'queues' (post-V1, you'd likely have one or more 'queues' grouped into an Org), and you configure a bunch of stuff for each 'queue': access control, capacity, etc (see HADOOP-3479 for details). You can certainly configure your system to have as many, or as few, 'queues' as you want, and I don't see them as directly linked to the Scheduler implementation. There are certain queue configuration values that may affect scheduling decisions: for example, you may want to give more weight to some 'queues', so they have greater access to free resources. But it seems like you're suggesting a much stronger connection between the number of queues and the scheduler implementation. If so, can you elaborate? Perhaps a use case would help? >> The requirements do not include how jobs/resources are prioritized across queues. Req 3.3 says that "comparison of priorities makes sense within queues, and not across them". what this means is that it doesn't make sense for jobs to be prioritized across queues. Since a user submits a job to a particular queue, he/she really only cares where the job ends up in that queue. Maybe what you're asking is how are free resources (Map/Reduce slots, in V1) distributed across queues. For example, if a TT has a free Map slot, which queue does the RM look at? The requirements don't address that (maybe they should). There are a few options for implementation, which are discussed in HADOOP-3445 (see point #2 under the section 'When a TT has a free Map slot'). If you have some other ideas in this respect, I'd love to hear them on that Jira. >> The good thing about this design is that it decouples the scheduling from the already ambitious requirements. We can start with FIFO scheduler or strict priority scheduler and evolve in the long run. Exactly. It's important to keep the Scheduler piece (the component that decides which free resource to be allocated to which job/task) separate enough. Some scheduling algos work well with a full view of the system. Some may work well by looking only at their own Org, as you suggest. For now, the Scheduler algorithm continues to be the same - use priorities and FIFO to sort jobs, then use data locality and some other stuff to decide which task to assign the TT. This seems to be working reasonably well (we've been getting good data locality hits) so far, but can evolve if needed. I hope I didn't go off on a tangent here.
          Hide
          Hong Tang added a comment -

          I think I got your point. Yes, historically, SuperComputing centers do use the name "Queue" to describe "Job repositories". So maybe we should stick with the name "Queue" (and I will always use "Job queue" in the following for clarity). What I describe as "scheduler" would then be the underlying mechanism and policy that arranges jobs in the "Job Queue", and in the scheduler, the implementation may have data-structure queues.

          If we support multiple "Job Queues" for each org, the scheduler in each may be oblivious to each other and thus it could be harder to implement intelligent scheduling (due to unpredictable interferance).

          So the reason I see you want to have multiple "Job Queue" is that you can associate access and capacity control at the "Job Queue" level (instead of at the job level). What I am thinking is that we can provide only one "Job Queue" per org, but inside each "Job Queue", we can have job classes, and we associate resource allocation, prioritization of jobs in the same class, access/capacity control a the job class level. Admittedly, it is just a change of view. If we can clearly define how resources are allocated among different "Job Queues" and use a single scheduler to drive all Job queues in one Org, then we get mostly the same thing...

          Show
          Hong Tang added a comment - I think I got your point. Yes, historically, SuperComputing centers do use the name "Queue" to describe "Job repositories". So maybe we should stick with the name "Queue" (and I will always use "Job queue" in the following for clarity). What I describe as "scheduler" would then be the underlying mechanism and policy that arranges jobs in the "Job Queue", and in the scheduler, the implementation may have data-structure queues. If we support multiple "Job Queues" for each org, the scheduler in each may be oblivious to each other and thus it could be harder to implement intelligent scheduling (due to unpredictable interferance). So the reason I see you want to have multiple "Job Queue" is that you can associate access and capacity control at the "Job Queue" level (instead of at the job level). What I am thinking is that we can provide only one "Job Queue" per org, but inside each "Job Queue", we can have job classes, and we associate resource allocation, prioritization of jobs in the same class, access/capacity control a the job class level. Admittedly, it is just a change of view. If we can clearly define how resources are allocated among different "Job Queues" and use a single scheduler to drive all Job queues in one Org, then we get mostly the same thing...
          Hide
          Christian Kunz added a comment -

          1) I would be interested to know why preempting a running job has been excluded from the design. I can imagine use cases (say in a production environment) where a high-priority job should be executed basically instantaneously at the expense of running jobs at lower priority.
          2) Will the job scheduler at least be able to kill individual tasks of running jobs to get resources back?
          3) Is 'N minutes' in 1.6 configurable per org?
          4) 8.1: shouldn't the RM be able to allow 20k+ nodes?

          Show
          Christian Kunz added a comment - 1) I would be interested to know why preempting a running job has been excluded from the design. I can imagine use cases (say in a production environment) where a high-priority job should be executed basically instantaneously at the expense of running jobs at lower priority. 2) Will the job scheduler at least be able to kill individual tasks of running jobs to get resources back? 3) Is 'N minutes' in 1.6 configurable per org? 4) 8.1: shouldn't the RM be able to allow 20k+ nodes?
          Hide
          Vivek Ratan added a comment -

          >> 1) I would be interested to know why preempting a running job has been excluded from the design. I can imagine use cases (say in a production environment) where a high-priority job should be executed basically instantaneously at the expense of running jobs at lower priority.

          You're right - there are legitimate use cases where you want to preempt running jobs for others. But we need to be clear on the exact semantics for preemption, especially since we're doing task level scheduling. What Req 3.3 implies is that if a job with a higher priority comes in while a job with a lower priority is running (i.e., some of its tasks have run, or are running), then going forward, the runnable tasks of the higher priority job will be scheduled before the runnable tasks of the lower priority job. What we will not do is kill the running tasks of the lower priority job (except in the case of Req 1.6). This becomes an issue only if the tasks of the lower priority job are long running. Essentially, what we're trying to limit in V1 is the killing of running tasks. Otherwise, you do get preemption in the sense that tasks of the higher priority jobs run earlier, or to put it more generically, the higher priority job has higher/earlier access to a queue's resources than the lower priority job. The flip side here is that you will may end up with a large number of running jobs (which are jobs with at least one task running or having run). This can cause a strain in resources (running jobs use up temp disk space and have a higher memory footprint in today's JT).

          >> 2) Will the job scheduler at least be able to kill individual tasks of running jobs to get resources back?

          Only to satisfy Req 1.6 for now. It's possible that we kill tasks in more situations, in later versions.

          >> 3) Is 'N minutes' in 1.6 configurable per org?

          Right now, the thinking is to keep things simple (just so we can get something out soon), which means that we're leaning towards global configuration values rather than per Org, in many cases. However, it'd be cool if folks can submit patches to add more finer-grained options. Clearly, having a per-Org value of N makes sense. See HADOOP-3479 for more details on the implementation effort for configuration.

          >> 4) 8.1: shouldn't the RM be able to allow 20k+ nodes?

          The 3K value is for V1. It ties in to the scale we can support with Hadoop in general. Clearly, as HDFS and MR, and Hadoop as a whole, start scaling to more and more nodes, the Resource Manager will have to keep pace. It doesn't make sense to actually implement V1 to handle 20K+ nodes when other Hadoop components do not scale that much. But yes, the RM architecture should be able to handle numbers like 10K or 20K. Maybe that req can be rephrased to say that the 3K number is for now, and we expect it to grow to 10 K or 20K soon?

          Show
          Vivek Ratan added a comment - >> 1) I would be interested to know why preempting a running job has been excluded from the design. I can imagine use cases (say in a production environment) where a high-priority job should be executed basically instantaneously at the expense of running jobs at lower priority. You're right - there are legitimate use cases where you want to preempt running jobs for others. But we need to be clear on the exact semantics for preemption, especially since we're doing task level scheduling. What Req 3.3 implies is that if a job with a higher priority comes in while a job with a lower priority is running (i.e., some of its tasks have run, or are running), then going forward, the runnable tasks of the higher priority job will be scheduled before the runnable tasks of the lower priority job. What we will not do is kill the running tasks of the lower priority job (except in the case of Req 1.6). This becomes an issue only if the tasks of the lower priority job are long running. Essentially, what we're trying to limit in V1 is the killing of running tasks. Otherwise, you do get preemption in the sense that tasks of the higher priority jobs run earlier, or to put it more generically, the higher priority job has higher/earlier access to a queue's resources than the lower priority job. The flip side here is that you will may end up with a large number of running jobs (which are jobs with at least one task running or having run). This can cause a strain in resources (running jobs use up temp disk space and have a higher memory footprint in today's JT). >> 2) Will the job scheduler at least be able to kill individual tasks of running jobs to get resources back? Only to satisfy Req 1.6 for now. It's possible that we kill tasks in more situations, in later versions. >> 3) Is 'N minutes' in 1.6 configurable per org? Right now, the thinking is to keep things simple (just so we can get something out soon), which means that we're leaning towards global configuration values rather than per Org, in many cases. However, it'd be cool if folks can submit patches to add more finer-grained options. Clearly, having a per-Org value of N makes sense. See HADOOP-3479 for more details on the implementation effort for configuration. >> 4) 8.1: shouldn't the RM be able to allow 20k+ nodes? The 3K value is for V1. It ties in to the scale we can support with Hadoop in general. Clearly, as HDFS and MR, and Hadoop as a whole, start scaling to more and more nodes, the Resource Manager will have to keep pace. It doesn't make sense to actually implement V1 to handle 20K+ nodes when other Hadoop components do not scale that much. But yes, the RM architecture should be able to handle numbers like 10K or 20K. Maybe that req can be rephrased to say that the 3K number is for now, and we expect it to grow to 10 K or 20K soon?
          Hide
          Vivek Ratan added a comment -

          A new requirement is to be added to Version 1:

          4.2. Memory limits.

          • There is a cap, MAX_MEM, on the total memory used by all Hadoop tasks and their descendants on a node. This cap applies to virtual memory size, not to resident size. MAX_MEM is configurable for each node.
          • Each task, including its dependents, has a limit MAX_MEM_PER_TASK, which by default is equal to MAX-MEM/N, where N is the total number of tasks that the TT is configured for. A task that exceeds its memory cap will be killed.
          • Users can, however, optionally provide their own MAX_MEM_PER_TASK limit for each task in a job. The RM will take into account available memory on a TT as well as a task's specified limit when scheduling.
          Show
          Vivek Ratan added a comment - A new requirement is to be added to Version 1: 4.2. Memory limits. There is a cap, MAX_MEM, on the total memory used by all Hadoop tasks and their descendants on a node. This cap applies to virtual memory size, not to resident size. MAX_MEM is configurable for each node. Each task, including its dependents, has a limit MAX_MEM_PER_TASK, which by default is equal to MAX-MEM/N, where N is the total number of tasks that the TT is configured for. A task that exceeds its memory cap will be killed. Users can, however, optionally provide their own MAX_MEM_PER_TASK limit for each task in a job. The RM will take into account available memory on a TT as well as a task's specified limit when scheduling.
          Hide
          Harsh J added a comment -

          Resolving as dupe of MAPREDUCE-279. Although, this is much better doc-wise and serves as a good reference.

          Please reopen if I missed something that the other didn't provide (and was the goal here).

          Show
          Harsh J added a comment - Resolving as dupe of MAPREDUCE-279 . Although, this is much better doc-wise and serves as a good reference. Please reopen if I missed something that the other didn't provide (and was the goal here).

            People

            • Assignee:
              Unassigned
              Reporter:
              Vivek Ratan
            • Votes:
              1 Vote for this issue
              Watchers:
              27 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development