A much better model is for the scheduler to pick specific TaskTrackers and reserve slots on them while accounting for the same against the HighRAMJob and it's queue. This would mean that once there is a reserved slot(s), per-task of the HighRAMJob, other slots in the cluster can be handed out to other jobs/queues in the cluster.
Once the accounting for reserved slots is fixed, it would automatically ensure that a HighRAMJob can only reserve slots upto the quota of the queue it belongs to. Hence the next enhancement is to pick specific slots and hold them rather than hold slots on every TaskTracker.
Picking slots for High RAM Jobs
The key to better support for HighRAMJobs is to reserve slots on specific TaskTracker. Of course one could get arbitrarily clever while picking slots, factors to be considered are:
- Locality of input for the specific map-task of the job
- Minimize expected delay time until the slot in freed on a specific !TaskTracker
For the first cut, I'd propose we consider only locality and not expected time. Once we fix speculative execution (
HADOOP-2141), we will more of the necessary features to predict expected time etc., hence the pushback.
Accounting for Reserved Slots
It is critical that we charge the queues of the HighRAMJobs when we hold reserved slots for them to ensure that they stay under their capacity and can't runaway with slots in the cluster. The proposal is to charge jobs/queues immediately when we reserve slots on a TaskTracker (when it can't be immediately run).
While metering HighRAMJobs, it would be incorrect to meter jobs (slot-hours etc.) by equating reserved slots to running slots. The proposal is to meter HighRAMJobs for open-but-held slots and running slots. (Open but held slots are those which are free on the TaskTracker but are being held while more become available for the HighRAMJob's tasks.)
Notes on Implementation and Challenges
As discussed above the proposal is to consider just data-locality while reserving slots. Assuming this, there are a couple of implementation choices once we reserved the slot:
- Proposal1: Hand out the task to the TaskTracker with a directive to start the task only when sufficient slots are freed-up to the this task.
- Proposal2: Hold the task at the scheduler noting which slot (i.e. TaskTracker) has been reserved for the same.
Here we would introduce a queue of ready to run tasks at the TaskTracker and fill it in with the task of the HighRAMJobs.
- The primary advantage of taking this route is that it greatly reduces the cost of implementation; it is fairly simple to introduce a WAITING_FOR_SLOT state for the task and have the necessary information at the TaskTracker to launch it at the appropriate time (i.e. when sufficient slots are free).
- Looking ahead, this might be a good start to do more global scheduling across jobs too where we might
- The major problem with this approach is that it touches a fairly sensitive part of code in the current implementation of the framework... it's fairly risky to tweak the TaskTracker code at this point, along with the JVMManager etc.
- We would still need to tweak the JobTracker to handle the WAITING_FOR_SLOT state e.g. ensure the TaskInitializationThread doesn't kill these tasks etc.
- We need to consider how this affects other schedulers (probably will not).
Here we would start marking slots as reserved (per task per job) and maintain information to assign the slot to the task when it eventually does free up.
- Simpler since all state management is done centrally.
- Lesser risk since all information is maintained in the scheduler.
- Currently the framework isn't setup to maintain this information: we do not have a single place (e.g. a TT class in the !JobTracker) to maintain information per-tracker i.e. reserved slots etc.
- More engineering effort to maintain maps from !TaskTracker to task to which it's reserved for and vice-versa.
- Proposal 1 for the attendant benefits and the leverage it gives us going forward (global scheduling etc.)
It is important for users (and queue-admins) to understand that there are slots which are reserved for HighRAMJobs which result in lower running maps/reduces w.r.t the queue-capacities. It would be nice to add reserved slots to the JobTracker/Job UI, and also to the Queue-Info in the Scheduler page.