My comments are more relevant in the context of HADOOP-5199, but I'm adding them here.
Pools may be marked as either FIFO or fair sharing.
There is another kind of pool which is priority based, where jobs are sorted by priorities. Jobs with the same priority are sorted by submission time (rather, start time). This is the kind of pool that the Capacity Scheduler was meant to handle. So maybe you have two pools: fairshare and priority (priority pools have an option to ignore priorities, in which case jobs are sorted only by submission order), or maybe you conceptually have three: fairshare, FIFO (only sorted by submission order), and priority (sorted by priority, then by submission order).
Each pool has a minimum share (guaranteed share) defined in the config. Any excess capacity is divided between pools according to fair sharing, as in the current scheduler.
I'm wondering if there's a way to simplify this a little more. Suppose pools only have a minimum (guaranteed) share, which you can also think of as a pool's fair share. Any excess capacity is simply claimed by pools that are 'running behind'. So, a scheduler would first pick a pool based on how much the pool is being utilized (which depends on how many tasks the pool is running compared with its minimum/fair share). If pools running below capacity do not have a task to run, we'll eventually pick a pool that's running at/above capacity and give the slot to it. That way, you don't need a notion of two numbers for a pool - minimum and fairshare - and you don't need to explicitly compute excess capacity and redistribute it among pools. Does that make sense? Or do you still feel you need the notion of both a min share and a fair share?
Now for the purpose of scheduling, we can have one big list of runnable jobs, each of which has a min share and a fair share.
Rather than having a global list of jobs, how about each pool just maintains its own list of jobs in the sorted order you mention? The scheduler first picks a pool as I detailed in my previous point, then you can pick a job based on the ordering you've mentioned. Sorting all jobs in the cluster is expensive. Sorting jobs in a single pool will be cheaper because there are fewer jobs.