I've been looking at what it will take to extend this from Arun's February sketch to something that will actually schedule and run small jobs. At the moment it looks like it will largely avoid JobTracker; most of the action seems to be in JobInProgress (particularly initTasks() and obtainNew*Task()), TaskInProgress (findNew*Task(), getTaskToRun(), addRunningTask()), and the scheduler (assignTasks()).
I haven't asked Arun what he originally had in mind, but it seems that there are two fairly obvious approaches:
- treat UberTask (or MetaTask or MultiTask or AllInOneTask or ...) as a third kind of Task, not exactly like either MapTask or ReduceTask but a peer (more or less) to both
- treat UberTask as a variant (subclass) of MapTask
I see the first approach as conceptually cleaner, and some of its implementation details would be cleaner as well, but overall it's harder to implement: there are lots of places where there's map-vs-reduce logic (occasionally with special setup/cleanup cases), and most of it would require modifications. The second approach seems slightly hackish, but it has at least that one big advantage: many of the map-vs-reduce bits of code would not require changes - in particular, I don't believe it would be necessary to touch the schedulers (and we're up to, what, four at this point?) since they'd simply see a job containing one map and no reduces.
Thoughts? (I don't yet have a strong opinion myself, though I'm interested in getting a proof of concept running quickly so I/we can see what works and whether it's a net win in the first place; from that perspective, the latter approach may be better.)
Either way, there will be changes needed in the UI, metrics, and other accounting to surface the tasks-within-a-task details, as well as the obvious configuration-related changes. Retry logic, speculation, etc., are still unclear (to me, anyway).
Stab at a couple of the other upstream questions:
Will users be able to set a number of reduce tasks greater than one? Or are you limited to at most one reduce task?
The latter, at least for now. This is somewhat related to the question of serial execution; i.e., if it's serial, the only reason you would want multiple reduces is if a single one is too big to fit in memory, and if that's the case, it's arguably not a "small job" anymore.
i am still a little dubious about whether this is ambitious enough. particularly - why serial execution? to me local execution == exploiting resources available in a single box. which are 8 core today and on the way up.
Well, generally multiple cores => multiple slots, at which point you let the scheduler figure it out. Chris mentioned that there's a JIRA to extend the LocalJobRunner in this direction (
MAPREDUCE-434?), and if the currently proposed version of this one works out, an obvious next step would be to look at similar extensions. But this is still an untested optimization, and all the usual caveats about premature optimization apply.
I don't know how Arun looked at the problem - I'm guessing the motivation was empirical, based on the behavior of Oozie workflows and Pig jobs - but I view it as way to make task granularity a bit more homogeneous by packing multiple, too-small tasks into larger containers. My mental model is that that will tend to make the scheduler more efficient - but also that trying to do too much may begin to work at cross-purposes with the scheduler, i.e., one probably doesn't want to create a secondary scheduler (trying to tune both halves could get very messy).
it does seem that a dispatcher facility at the lowest layer that can juggle between different hadoop clusters is useful generically (across Hive/Pig/native-Hadoop etc.) - but that's not quite the same as what's proposed here - is it?
Nope. But I think other groups are looking at stuff like that.