|
This is the first step in what may very well turn out to be a long, and perhaps somewhat ambitious, effort, but it is sorely needed. The performance penalty of HOD is damaging, and having users specify the number of nodes they want does not seem to be the right model for them to submit jobs. The requirements listed here are an attempt to get started and put us along the right path, and are nowhere close to being a complete set.
>> the scheduler people want fully deterministic "when will this job finish" information, which the Halting Problem says isnt on the table >> -keep scheduling independent; make it something that people can experiment with new algorithms, history based scheduling, etc. This is a very hard problem. >> Pay-as-you-go job submission, in which users have a budget of CPU time, has worked well for managing resources since the days of the mainframe. It's a good way of limiting abuse. >> long haul GUI/tools could be useful; assume the client is on a laptop behind a firewall. This could be a good opportunity to make the client-API RESTy, against the web interface only. Building a generic job manager, useful beyond mapreduce, without having uses beyond mapreduce seems like a doomed approach. The goal should either be narrowed to providing a better map-reduce specific job manager (i.e., an enhanced jobtracker), or broadened to be generic cluster-management tools that can be used by, e.g., hdfs, mapreduce, hbase, etc. In this latter case, an HDFS deployment might be a long-running high-priority job, where nodes would only be added and removed manually by an administrator. A mapreduce session might be a relatively short-running job whose nodes are added and removed dynamically as the job runs. For example, nodes might be removed both by the job itself as reduces complete, and by the cluster, if higher-priority jobs are launched. An hbase job would probably look more like an HDFS job.
Doug, these are valid points. Agreed that building a generic Resource Manager now, without appropriate use cases, is not a good idea. A focus on the scheduling needs for MapReduce, especially in the early versions, is vital. For V1, most changes would likely be in enhancing the JT. However, we're also looking to get our abstractions and architectural boundaries right, so that in the future, the system can potentially support non-MR jobs, while continuing to deal with MR jobs effectively. I believe this can be done. Like you said, in the long term, an MR job is given resources, or resources are taken away from it, dynamically. That's exactly what would be desirable - a fairly elastic resource management system that can deal with dynamic and frequent resource (de)allocations for a job if needed, support resource constraints such as data/rack locality, and maximize utilization of resources. There'll clearly be many steps along the way, but for now, it seems to make sense to focus on handling the scheduling needs of MR jobs.
A lot of the scheduling code already works well in the JT, and we've been achieving decent data locality, so it'd be silly to throw that away. Adding support for queues, orgs, and quotas to this code makes sense to me. Vivek, have you looked at
I would love to help you getting ride of HOD !
I think that the scheduling part will be easy to implement, but it raises other concerns that we might need to discuss. When they'll be solved, the rest could take less than a week.
If you agree with that vision, I'll create issues for each of those steps. PS: I have problems with conditionals in English, and I'm unsure of using "might" "should" and "could" in the right way. Please forbid me. I have opened a Jira (HADOOP-3444) to track the implementation of these requirements.
Brice, good points on the security and access control. We should open a separate Jira (which can be tracked under HADOOP-3444) to deal with these. Also, we should link your work in
Chris and I were discussing this issue this morning. Although the description is phrased as a list of requirements, in reality it is much more a design.
Perhaps it would be good to agree on the requirements that are motivating this design. We may find that there are some requirements that aren't that important that make the design more complicated than it needs to be. Reading between the lines it sounds like you have the following requirements: These are the requirements that seem basic:
These seem to be motivating your design, but I can't understand why they are needed. (Perhaps I'm misreading the implied requirement.)
There are probably other requirements we are missing that you have in mind. (Perhaps they would clear up the last two.) It would be great to get them documented. I wish folks could resist providing huge descriptions for issues. The description field should contain a short description of the problem to be solved. Elaborations of the problem and proposed solutions should be developed as part of the discussion in the comments. The description gets appended to every message about the issue. I'll add something to the wiki about this...
Sorry Doug. Wasn't aware. I'll keep it in mind next time.
>> Although the description is phrased as a list of requirements, in reality it is much more a design.
IMO, these really are requirements. It's not very useful to say, for example, that a requirement is that a system be highly available. Yes, there are different ways to make it highly available, but it seems perfectly fine to provide some more detail if that's how you want the system to behave as a black box. Otherwise the requirements don't add a lot of value - they end up being fairly obvious. What's more important, again IMO, is that requirement be phrased in terms of usage, i.e., they should specify how external systems (users, etc) interact with the system. And it's important to specify in detail how user interaction will be affected. By explicitly mentioning 'guaranteed capacities' for Orgs and how excess capacity is redistributed and reclaimed, we're letting users know that that is exactly how the system will behave. By just saying the the utilization should be high, you open the door for the architecture/design to influence basic user interaction, which I don't think is right. There have also been some pains taken to make sure we don't talk about Hadoop components specifically, or that the requirements don't influence the design. >> Even if resources are available and fairness is met still limit the number of resources used by a job (4.1) Not sure I fully understand your concern. If resources are available, limits will not be enforced. The end of the first sentence for 4.1 says "if there is competition for them". User limits are applied only if there are more tasks than queue resources. Maybe that's not fully clear in the writeup, but that's the intention. >> Abstractions of organizations and queues are needed for ... (Is this an ops requirement? or a client usability requirement?) It's a bit of both. Orgs let us reason about policies, quotas, access control, accounting etc. Queues give you granularity within Orgs to support different scheduling policies. These abstractions affect client usability as users need to knwo which queue to submit jobs to, and hence need to be aware and in agreement with the queue's (and hence the Org's) policies. Why are you using orgs to reason about access control? Queues have ACLs. It seems strange to have two different access checks. It's also not clear why accounting at the queue level isn't enough. The policy and accounting also seem to be done at the queue. You may want orgs for accounting rollup, but you can do that outside the resource manager. Orgs are also often hierarchical, but it seems overkill to put it in the RM when you can easily rollup the accounting outside of the RM.
I'm not saying it's wrong to have orgs, but I don't see the requirement motivating the design. (You've stated it as a requirements, so perhaps I should say I don't see the motivation for the requirement.) I have similar thoughts about priorities. It seems reasonable to have priorities associated with a queue, so why do you also need it with a job? Some designs would define job priority as the queue the job is in. I realize you see some things as obvious, but these obvious requirements aren't clear from your design. It is interesting you bring up "highly available". It appears that HA is not a requirement. Correct? Your only availability requirement, which is in the parenthesis, is that a server can restart. If a server fails, the system can hang until it is restarted. Correct? The requirement under Availability is more of a persistence guarantee. Ben, these are valid concerns you raise.
The relationship between Orgs and queues, at least in the way we've been discussing it between a few of us, is that Orgs 'control' Grid resources ('guaranteed capacity' is one way to do that). And users belong to Orgs. Queues are a way for an an Org to divide its resources. We see an Org eventually having multiple queues. There may be a queue for very high priority jobs, which may get first dibs on an Org's resources, for example.You can imagine other kinds of queues as well, based on other scheduling criteria. Access control happens at both levels. Users can only submit jobs to the queues in their Orgs, and can submit to more than one queue. Individual queues may also control who is allowed to use them. Basically, Orgs and queues provide a hierarchy that we hope helps with configuration (you can have some configuration of policies at Org levels that applies to all queues, and some of these policies may be overridden at the queue levels). There are other ways to do this. You could just have queues in the system, but configuration and management would probably be harder when compared to a more hierarchical structure. It's fair to argue whether we need any hierarchy at all, and if we do, why just restrict it to Orgs and queues. Why can't queues have sub-queues, for example? My personal view is that these things will become more clear as we start using this system. It intuitively feels right to have Orgs and queues right now, and there are a number of use cases to justify it, some of which I've mentioned in this discussion. But, partly in deference to the fact that it may be too premature to work out the exact details between Orgs, queues, sub-queues, and what not, we'd like to build V1 with one queue per Org, basically equating the two. It's quite possible, and as you point out somewhat, that Orgs manifest themselves more outside the RM (in fact, in the design for V1, we really just deal with queues). I expect that over the course of the next many weeks/months, we'll be able to define a more concrete relationship between all these entities. IMO, it's premature to spend too much time on that for V1. Regarding priorities: queues only decide if they look at job priorities. Jobs always set their own priorities. If a queue respects priorities, it will order jobs based on job priorities. If it does not, it will order jobs based on when they were submitted. Regarding HA: there are lots of ways to build HA into the system. What we're suggesting for V1 is for the ability to restart the RM and continue from where it left off (i.e., users do not have to resubmit jobs). This adds a level of HA to the existing HOD system today. So rather than say 'HA is a requirement' (which, as I've mentioned earlier, does not buy us anything), we're specifying exactly what is a requirement in V1 to improve the system's HA. as a user, you're guaranteed that your jobs will persist after you submit them. That is the HA-related requirement this system will enforce. Yes, this is technically a guarantee of persistence, but that in turn gets you some level of availability. There is also some level of HA built into the JTs and TTs today, which will continue. if you're concerned whether Req 7.1 technically falls under 'HA', that's a different discussion. I think it does, but I also think it doesn't matter much whether you group it under 'Availability' or 'Persistence' or something else. But, you might differ. I think I may have taken my main point off track with examples. To give good design feedback we need the base requirements. It looks like you have already moved beyond that stage, so the point is moot.
Regarding Reqs 3.1 and 3.2: In Hadoop today, if a user does not assign a priority to a job when the job is submitted, the system assigns a default job priority. JobConf::getJobPriority() returns JobPriority.NORMAL if no priority has been set in the job's config file. This means that once a job is read into Hadoop, it always has a priority, as it is implemented today. That seems reasonable enough to me. In order not to change things around, the two reqs should be modified as follows:
3.1. Jobs have priorities associated with them (users can optionally assign a priority to a job, or else the system assigns a default priority). For V1, we support the same set of priorities available to MR jobs today. I realize that we're tweaking a requirement based on current implementation, but spirit of the requirement was that we continue letting users submit jobs as before, and the system's behavior does not change. I think that is captured better with the newer set of requirements. Queues are just artifacts of the design of a scheduler. The requirements do not include how jobs/resources are prioritized across queues.
IMO, it might be cleaner to have one scheduler for each org, and the interface of the scheduler consists of two API: "scheduleIn(Job, Priority)" and "Job scheduleOut()", where Priority is an opaque object specific to the particular scheduler used by org. Having one queue, multiple queues, what kind of queues is a decision inside the implementation of a particular type of scheduler. The good thing about this design is that it decouples the scheduling from the already ambitious requirements. We can start with FIFO scheduler or strict priority scheduler and evolve in the long run. >> Queues are just artifacts of the design of a scheduler.
Maybe I'm nitpicking here, and probably this is not very important, but queues are a requirement in this proposal, and queues, as mentioned in the requirements (which, hereforth, I will enclose in quotes) , are different from data-structure queues. A 'queue' is where users submit a job and is a user-facing feature. In the design, you may choose to implement 'queues' with whatever data structure you want. But there is something called a 'queue' which accepts user jobs, which supports access control, etc. Maybe calling it a job repository or some such thing helps? If there's confusion here, maybe it's to do with what people think should go in requirements and what should go in architecture/design? >> Having one queue, multiple queues, what kind of queues is a decision inside the implementation of a particular type of scheduler. >> The requirements do not include how jobs/resources are prioritized across queues. >> The good thing about this design is that it decouples the scheduling from the already ambitious requirements. We can start with FIFO scheduler or strict priority scheduler and evolve in the long run. I hope I didn't go off on a tangent here. I think I got your point. Yes, historically, SuperComputing centers do use the name "Queue" to describe "Job repositories". So maybe we should stick with the name "Queue" (and I will always use "Job queue" in the following for clarity). What I describe as "scheduler" would then be the underlying mechanism and policy that arranges jobs in the "Job Queue", and in the scheduler, the implementation may have data-structure queues.
If we support multiple "Job Queues" for each org, the scheduler in each may be oblivious to each other and thus it could be harder to implement intelligent scheduling (due to unpredictable interferance). So the reason I see you want to have multiple "Job Queue" is that you can associate access and capacity control at the "Job Queue" level (instead of at the job level). What I am thinking is that we can provide only one "Job Queue" per org, but inside each "Job Queue", we can have job classes, and we associate resource allocation, prioritization of jobs in the same class, access/capacity control a the job class level. Admittedly, it is just a change of view. If we can clearly define how resources are allocated among different "Job Queues" and use a single scheduler to drive all Job queues in one Org, then we get mostly the same thing... 1) I would be interested to know why preempting a running job has been excluded from the design. I can imagine use cases (say in a production environment) where a high-priority job should be executed basically instantaneously at the expense of running jobs at lower priority.
2) Will the job scheduler at least be able to kill individual tasks of running jobs to get resources back? 3) Is 'N minutes' in 1.6 configurable per org? 4) 8.1: shouldn't the RM be able to allow 20k+ nodes? >> 1) I would be interested to know why preempting a running job has been excluded from the design. I can imagine use cases (say in a production environment) where a high-priority job should be executed basically instantaneously at the expense of running jobs at lower priority. You're right - there are legitimate use cases where you want to preempt running jobs for others. But we need to be clear on the exact semantics for preemption, especially since we're doing task level scheduling. What Req 3.3 implies is that if a job with a higher priority comes in while a job with a lower priority is running (i.e., some of its tasks have run, or are running), then going forward, the runnable tasks of the higher priority job will be scheduled before the runnable tasks of the lower priority job. What we will not do is kill the running tasks of the lower priority job (except in the case of Req 1.6). This becomes an issue only if the tasks of the lower priority job are long running. Essentially, what we're trying to limit in V1 is the killing of running tasks. Otherwise, you do get preemption in the sense that tasks of the higher priority jobs run earlier, or to put it more generically, the higher priority job has higher/earlier access to a queue's resources than the lower priority job. The flip side here is that you will may end up with a large number of running jobs (which are jobs with at least one task running or having run). This can cause a strain in resources (running jobs use up temp disk space and have a higher memory footprint in today's JT). >> 2) Will the job scheduler at least be able to kill individual tasks of running jobs to get resources back? Only to satisfy Req 1.6 for now. It's possible that we kill tasks in more situations, in later versions. >> 3) Is 'N minutes' in 1.6 configurable per org? Right now, the thinking is to keep things simple (just so we can get something out soon), which means that we're leaning towards global configuration values rather than per Org, in many cases. However, it'd be cool if folks can submit patches to add more finer-grained options. Clearly, having a per-Org value of N makes sense. See >> 4) 8.1: shouldn't the RM be able to allow 20k+ nodes? The 3K value is for V1. It ties in to the scale we can support with Hadoop in general. Clearly, as HDFS and MR, and Hadoop as a whole, start scaling to more and more nodes, the Resource Manager will have to keep pace. It doesn't make sense to actually implement V1 to handle 20K+ nodes when other Hadoop components do not scale that much. But yes, the RM architecture should be able to handle numbers like 10K or 20K. Maybe that req can be rephrased to say that the 3K number is for now, and we expect it to grow to 10 K or 20K soon? A new requirement is to be added to Version 1:
4.2. Memory limits.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Some recommendations
-keep scheduling independent; make it something that people can experiment with new algorithms, history based scheduling, etc. This is a very hard problem.
-Pay-as-you-go job submission, in which users have a budget of CPU time, has worked well for managing resources since the days of the mainframe. It's a good way of limiting abuse.
-long haul GUI/tools could be useful; assume the client is on a laptop behind a firewall. This could be a good opportunity to make the client-API RESTy, against the web interface only.
Also 11.3 Security: you can only isolate work on the same CPU by running each job in a JVM sandbox with no networking or exec() capabilities. That may be something more broadly useful than just the RM.