Nice job, Tom. It's a lot cleaner and will make it easier for us to merge stuff from
HADOOP-3445. I did have a couple of comments:
A Scheduler is usually just an algorithm for deciding which task to pick for a TT. It uses information from the TT (its hostname, rack, what it's currently running, how many resources it has free), and information from what is required (which job/queue to look at, capacities, user limits) and makes the best match. I'm wondering if TaskScheduler should handle addition/removal of jobs. As we make the JT persistent, we need the ability to persist job state to disk, to initialize jobs (expand their task-based structures, as in JobInProgress::Init()) dynamically (since, in order to scale, you don't want jobs to be expanded unless absolutely needed), to store only relevant information in memory and the rest on disk. Something else should likely do this, not TaskScheduler. TaskScheduler needs to access the collection of jobs when it runs its scheduling algorithms, but it should not be responsible for them. Methods like addJob() and removeJob() probably belong to some other class, something like a JobQueueManager. Which, by the way, can also handle multiple queues of jobs, as we'll need for 3445. Maybe the JT itself can handle the queues of jobs initially. Regardless, do you think TaskScheduler should be responsible for jobs?
Another thought I had is regarding the work we're doing for 3445. It's more of an observation than a suggestion.
HADOOP-3421 introduces multiple queues, capacity per queue, and user limits. Each of these features affects the scheduling of tasks, which likely would go something like this: a TT indicates that it has one or more free Map or Reduce slots. JT figures out whether to look for a Map or Reduce task. JT needs to find a task in a job in a queue, so it first looks at which queue to consider (primarily based on queue capacities and whether the queue can accept a TT slot). Within the selected queue, the JT considers which job to look at, based on user limits (the job needs to be belong to a user who is not using more capacity than he/she is allowed) and priorities. Finally, within a job, the JT (actually, the JobInProgress object) needs to pick what task to run, based on data locality, speculation, and some other heuristics. My guess is that many developers will want to plug in tweaks to some of the pieces of entire scheduling algorithm, and not modify other logic. Someone, for example, may want to tweak how a queue is chosen, and not touch the other stuff. For this, IMO, we need to break down the JT's scheduling flow (decide on M or R task, then pick a queue, then pick a job, then pick a task), which now sits in JobQueueTaskScheduler, into discrete units and allow folks to override one or more of these units. There are a few ways to do this and we can use some of the suggestions and principles applied in this patch there as well. I guess I'm really making a plug here for folks to look at extending the stuff for 3445 (once it is ready) in a similar way, so that the entire scheduling flow can be made extensible at different steps.
Once again, I don't think merging the 3445 work with this patch should be very hard, especially with the latest patch. I'll take a look at that soon. Thanks, Brice for your offer to help. I'm sure you can help us out there. And of course, I'm glad you took this whole effort up and pushed it all the way to where it is. Nice work.