|
Devaraj Das made changes - 31/Mar/08 01:08 PM
Mukund Madhugiri made changes - 07/Jun/08 01:26 AM
This would indeed be very helpful for utilization, as well as for reducing response time of short jobs. Now that
I don't think we need to change any of the map reduce libraries. The implementation of the scheduler should return a bunch of tasks depending on the current number of free slots the tasktracker has (maybe by running the JobInProgress.obtainNewMap/ReduceTask multiple times). It could optionally decide to do some intelligent assignment of reduces (but that could be scoped for a future jira).
This is a significant problem on larger clusters (I've been testing on a 3500 node cluster) and hence needs to be fixed asap.
Arun C Murthy made changes - 29/Jul/08 05:00 AM
Arun C Murthy made changes - 29/Jul/08 05:00 AM
With the new scheduler code, it should be possible to have a loop in JobQueueTaskScheduler.java to assign multiple tasks (however many map slots and reduce slots are free, all at once), right? I'll do this in the fair scheduler as well.
hey folks - there are some downsides with what's proposed here:
the idea of dispatching multiple scheduling decisions in one shot is cool. but the idea of making multiple scheduling decisions only considering the availability of a single node is not that great. the current scheduler limps along with this badness since it only makes one decision at a time (so what's locally optimal is usually globally optimal - esp. as far as map locality is concerned). but that would increasingly not be the case when multiple decisions need to be made in one shot. here's a slightly different - but even easier way of solving this that avoids this downside:
The reasoning is as follows:
there should not be any impact on JT scalability or overall network traffic. the network and JT should be able to bear traffic that would result from the cluster being 100% busy (and that's what we are emulating here). in addition - where we are adding traffic (cluster idling) - other traffic (DFS/shuffle) traffic would be lower. longer term - we should try to separate scheduling (which should consider global resource availability) from dispatching (which is a per node activity). But this seems like a much bigger problem. The scheduler would need to keep the code that figures out how many slots should be used per a node, so that you don't run with half of your cluster fully loaded and the other empty.
We don't want to increase the heartbeat interval, we need to decrease it. In particular, we currently send a heartbeat when a task finishes. That is very costly and should stop. I should expand my comment. Currently the JT does a #maps/#TT to figure out the load and won't give a TT more than 1 + avg(load) map tasks. (and similarly for reduce) Now that you can have heterogeneous clusters, we should probably change this to:
#maps/#slots = map load which means that you won't load a given node to its capacity unless the cluster is full, but deals with differences in hardware speed. Agree with Owen that it's for the scheduler to decide how many tasks to assign to a TT in one heartbeat. Joydeep, your concerns are very valid - a scheduler should not be dumb and ignore the overall load/state of the system. Otherwise you can end up with lopsided scheduling. And yes, it's fairly hard to decide whether to give the TT more than one task, and if so, what combination of Maps and Reduces from what jobs. But schedulers will need to start dealing with this. Maybe the first step is to assign more than one task only if the system is loaded, and assign tasks from different jobs (as you mention) to spread them around. At some point, a scheduler should also consider the resources available on the TT (mem, CPU) and use that to decide what combination of Map and Reduce slots should run on that node. But, as Owen says, this should be a scheduler decision and we do want to cut down on the heartbeat calls.
i don't know how big a change we are envisaging here. my suggestion was thinking if we wanted to minimize changes to current code base. I don't think the protocol suggested would send any more heartbeats than we do in the worst case today (or that at least it could be tuned to get that property) - but if the plan going forward is to reduce heartbeats - then that's not good enough.
If we are planning non-trivial changes - it seems to me that it would be nice as a first step to have scheduling decisions dispatched pro-actively to slaves on new job arrival (after considering global resource availability). that itself should save on latency of scheduling on an idling cluster. on a backloaded cluster, scheduling reactively to heartbeats seems reasonable (in fact - on a backloaded cluster - it's likely that each heartbeat would only advertize one available slot. unless of course we move away from current protocol for sending heartbeat on task completion). the other significant improvement i can think of is maintaining a runnable task queue on each TT (in addition to the ones running). that way there would be no idle time lost to getting new tasks. the protocol of dispatching tasks to TT runnable queues could happen in the background where one could make optimizations to reduce network traffic. (for example - if the TT had a good sized queue of runnable tasks then it could afford to delay sending heartbeats on task completion (since delaying would no longer cause idle time)). This may also allow the JT to make batched scheduling decisions (instead of immediately looking for something to schedule when slots become free - the JT could wait and accumulate some larger number of free slots before making a global scheduling decision over all those slots and the available tasks. this might make things more globally optimal while still not causing idle time (since there every TT has a queue of runnable tasks)). The runnable task queue on each TT sounds good from the point of view of globally optimal scheduling. The tasktrackers would then require to handle priority/fairness among the multiple jobs. It appears to me that handling fairness/priority is easier at the JT end.
hmm - that would make things very complicated indeed. i really meant a queue - FIFO. the TT would just run serially off the queue subject to available slots. the JT decides the ordering off the queue entirely.
with a single runnable queue - the major downside is that it becomes much harder to quickly respond to high priority tasks. so - one would then have to invent multiple queues (reflecting JT internal data structures) of different priorities. an entirely different line of attack maybe to think about this as a JT scale out (as opposed to performance) problem and figure out how to have multiple JTs. a hierarchical one is easy to think of - there's a master JT and a JT per rack perhaps. there is still some similarity with the previous scheme in that both these levels of trackers would need multiple internal priority queues. but the TT/rack-JT communication would still be high frequency (as today) - in which case - schemes that call for increasing heartbeat rate would be entirely feasible (since there's one JT per rack). Patch for early review while I continue testing; this will conflict with some of the other patches in flight...
Arun C Murthy made changes - 05/Aug/08 09:47 AM
Updated to reflect recent changes to trunk...
Arun C Murthy made changes - 09/Aug/08 07:30 PM
Arun C Murthy made changes - 09/Aug/08 07:30 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12387886/HADOOP-3136_1_20080809.patch against trunk revision 685353. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3046/testReport/ This message is automatically generated. I think that we should give less tasks of a particular type (map/reduce) when the corresponding load of that type is less. So for e.g., if a TT asks for task(s) to run, and at this point of time, we have 200 as the remaining-map-load and 1000 free map slots, we should give out just 1 task to this TT. The logic in the patch would give out more than that (to be precise, it would be the number of available slots on that TT), right? Also, should we consider the "free" slots as opposed to the total number of slots in the calculation for maxMapLoad (or reduceLoad).
I need to track a weird behaviour I see with this patch on GridMix, looks like this patch causes a bug where we hand out more tasks than we should - I need to look closer.
Arun C Murthy made changes - 25/Aug/08 05:05 PM
Updated patch; I ran smalljobsbenchmark on Devaraj's suggestion too and didn't notice any major drawbacks (~5%).
Arun C Murthy made changes - 12/Sep/08 05:25 AM
Arun C Murthy made changes - 12/Sep/08 05:25 AM
Owen pointed out that we can and should decrease the heartbeat interval now, given that we have cut off the 'crazy heartbeats' i.e. heartbeats resulting from a single task's completion.
Arun C Murthy made changes - 15/Sep/08 08:51 AM
Arun, did you monitor how many tasks ended up data-local/rack-local when this feature is enabled ?
I believe it went down from 90ish percent to around 60%. Runping?
Ok, we really need to get better scheduling to go along with assigning multiple tasks – blindly assigning multiple tasks, per the current patch, has a very negative effect on locality of tasks as TaskTrackers 'steal' data-local/rack-local tasks from each other and consequently adversely affects jobs. This patch slowed GridMix down by around 30%.
Here are some hallway ideas we threw around today for better scheduling which is imperative for assigning multiple tasks: 1. Oversubscribe Slots In this approach we need to assign more than one task per slot and queue them up at the TaskTracker. This approach allows for much larger heartbeat intervals (the TaskTracker has queued-up work to finish) and helps with scaling out clusters. 2. Pre-allocation of tasks to TaskTrackers via Global Scheduling Here we build up queues for each TaskTracker (and each Rack) at the JobTracker based on locality. Each task might be on multiple queues depending on which DataNode/TaskTracker it's data is present. When a TaskTracker advertises empty slots we just pick off it's list and assign it. Basically this implies that we consider the 'global' picture during scheduling and ensures that TaskTrackers do not 'steal' data-local tasks from each other. Of course to ensure that the highest-priority job doesn't get starved we need to assign atleast one of it's tasks on each TaskTracker's heartbeat, even if it means we schedule an off-rack task. Similarly, to avoid spreading task-allocation too thin i.e. across too many jobs, we need to ensure that that the TaskTrackers' lists only contain tasks from a reasonably small set of the highest-priority jobs. 3. HADOOP-2014: Scheduling off-rack tasks Thoughts? At the very least, it would help to schedule at least one reduce and one map per heartbeat, because those are pretty independent. This seemed to make a difference in my tests using Gridmix at Facebook because Gridmix would some periods of time when it wouldn't launch reduces at all. Beyond that, for maps, maybe it would also help to limit the number you launch per heartbeat (say to 2), or to ask the job for tasks until it stops giving you local tasks? The current patch seems to go all out asking for tasks from the top job in the queue.
To clarify: In the 'Pre-Allocation' scheme the per-TaskTracker list has tasks sorted by the priority of their jobs. It is ok to not update all lists simultaneously, we just need to mark the TaskInProgress on allocation and they can subsequently be deleted from other lists after checking to ensure they have already been allocated. Similarly, we need to add the TIP back on all lists on task failure. Essentially it moves the current Job-specific caches to a global per-TaskTracker lists.
Clearly we need further thought and discussions along these lines. Should we target anything quick/dirty for 0.19.0? Matei, I agree that assigning atleast one map and one reduce will help - at least in the single reducer case. A good short-term goal.
This does have implications for the removal of the 'crazy heartbeats' (i.e. currently the TaskTrackers send a heartbeat out as soon as any task completes). Maybe we should send a heartbeat immediately when a TaskTracker notices that the last running map completed?
We do need to be slightly careful here: if we do agree that 'global scheduling' is the right approach then I'd rather not introduce short-term hacks to pull tricks like those. I'm hoping to get everyone to think about a long-term solution and keep that in mind while we target short-term fixes, does that make sense? If you guys have a Gridmix cluster lying around, I'm very curious about how well a 1-map/1-reduce policy does and how well a 2-map/1-reduce policy does. Clearly this patch is hard to evaluate without tests on large clusters.
The simplest remedy to the current patch is to assign at most one non-local mapper at each heart beat (if multiple non-local mappers are selected, throw away all but one). Nest refinement will be that, if the current job does not have enough data/rack local mappers for a TT, the job tracker can get some node-local/rack-local mappers from the next few (say 5) jobs, if available. Each heart beat should deliver one reducer to a TT About the crazy heart beat: I think TT can employ some basic intelligence. In general, the Job tracker should tell TT how busy it is. The TT should use that time to decide when to send the next heart beat. Stealing from the lower priority jobs is bad. First it increases the load on the temporary disk space. Second, it hurts
+10 on #2
it's not surprising at all that the initial proposal in this jira makes locality worse. one of our concerns with scheduling right now is the awful locality for small jobs that results from greedy allocation. to that extent - would really welcome any attempt at global scheduling. the queues should also allow JT to make scheduling decisions for a bunch of jobs at one time - instead of for the job as soon as it arrives. (this can be based on how busy the cluster is - with a idle cluster scheduling decisions are better made immediately - with a busy cluster - one can club things together and then decide since the nodes anyway have work to do). this is also an opportunity to attack JT scalability. why keep the queues at the JT only? send them to the TT as well. That way - we don't have to send a heartbeat each time a slot frees up - those can also be batched up by the TT without incurring any idle time (again because of the runnable queues) and should help JT bottlenecks. there are a lot of nuts that can be cracked at once here - so might be worth more thought. What if the heartbeat was really cheap? If all it did was informed the JT of the opportunity to schedule a task to the node and returned task compete info, than I don't think we'd have a problem with a heartbeat per task complete.
I'd advocate this. Decouple scheduling decisions from heartbeats. The JT could the do whatever scheduling it chose and assign new tasks to TT directly when ready. — The problem with assigning many local tasks to a TT on a heartbeat is that you could still end up assigning many tasks to a few nodes and none to others and getting slower execution. We still might want to try that since, the results could still be better than any of our current options and it is simple to code and understand. The suggestion: If (node local tasks available) { I'm sure we could do better, but this is simple and worth trying. — I think it is clear we're going to need a global scheduler that plans the allocation of a jobs tasks to nodes globally and then monitors execution and adjusts the plan. — On reduces, sounds good to allocate those separately.
Robert Chansler made changes - 22/Sep/08 08:03 PM
Ok, with
Meanwhile, given the benefits of assigning multiple tasks, I'd propose we go ahead and do a quick-fix along the lines suggested by Runping/Eric. It clearly is a win over what we have currently. Thoughts?
Arun C Murthy made changes - 23/Sep/08 09:18 PM
+1 on using the heuristic and looking at the scheduling problems in a bigger context.
+1 on TESTING the heuristic. I'm worried about cases where we over schedule the first half of the cluster and leave the other half idle or remote scheduled.
Eric, regarding the cluster overloading, can't we limit the number of slots we use per node when the cluster is undersubscribed, the same way the current one-task-at-a-time scheduler does? As far as I understand, the current scheduler limits the number of tasks it runs on a node to min (number of slots per node, (total number of tasks) / (total number of slots)). For example, if you have a cluster with 50 machines with 4 slots each (i.e. 200 slots in total), and you submit a job with only 100 maps, it will only launch up to 2 maps on each node.
Hi Matei,
Don't let me discourage you from getting some experimental data on If a job only runs on 1/5 of the nodes on a cluster, you may amortize If you have 1000 maps to run on 1000 nodes, your might like to fill On the other hand, you need to make sure you keep your node and rack E14 Here is a patch which fixes the default scheduler to assign multiple tasks per heartbeat. The patch ensures that no more than one off-rack task is handed per heartbeat, changes the TaskTracker to ensure that it doesn't send a heartbeat on every task completion and reduces the heartbeat interval from a ratio of 1s per 50 trackers to 1s per 100 trackers.
Arun C Murthy made changes - 12/Dec/08 09:44 AM
Why assign an off rack task if you could also assign on rack? Seems like the ability to assign an on rack task indicates that other TT might benefit from having a chance to play as well. Have you tested these two strategies?
It seems like the TT should notify the JT whenever any task fails. A server should be able to deal with thousands of notifications / sec and more complete knowledge should allow better behavior. The problem seems to be that processing these heartbeats is too expensive? This would argue for a more complete JT redesign. We should probably start another thread on that, where the JT plans and queues tasks for task trackers in one thread. Heartbeats would then be very cheap, since it would just be a state update on a few tasks and a dispatch of already queued work.
The strategy is to assign as many node-local and rack-local tasks per heartbeat as available slots, however no more than 1 off-switch task per heartbeat regardless of the number of available slots. To prevent under-utilization of the TaskTracker (which might have got only one off-switch task) this patch halves the heartbeat interval (which we can afford now, given that this patch removes the heartbeat on every task completion).
Agreed, we would need to implement finer-grained locking in the JT to reduce the cost of heartbeat processing - currently each heartbeat locks up the entire JT (ala the BFLock in the Linux 2.2 kernel... smile). HADOOP-869 tracks this.
+1. I'll open another jira for global scheduling. Fixed some test cases which assumed only one task was being assigned per heartbeat... I've changed TestJobTrackerRestart.testtestRecoveryWithMultipleJobs to ignore the finish times as they are unpredicatable with fast-start of tasks now.
Arun C Murthy made changes - 13/Dec/08 12:48 AM
Arun C Murthy made changes - 13/Dec/08 12:48 AM
I guess I am too late in commenting on this. But one thing that might be worth doing is to go and ask for a bunch of tasks as soon as the TT empties its queue of tasks (TaskTracker.TaskLauncher.tasksToLaunch is the queue). We batch the request for new tasks on the basis that the TT started running all the queued tasks and we should now backfill the queue. That way, the TT would never really be idle. This might be important for GridMix kind of applications where there are many small tasks...
I had one comment.. is it possible to change the "findLocalTask" parameter in JobInProgress.findNewMapTask from a boolean to an int maxLevel for the maximum level in the topology where you can look? This would be useful for being able to search for data-local tasks before rack-local tasks. Now data vs rack locality shouldn't make much of a difference in most workloads, but I had some tests at Facebook where data-local tasks seemed to be faster, and this is a small change that will leave this possibility open.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12395984/HADOOP-3136_4_20081212.patch against trunk revision 726129. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3738/testReport/ This message is automatically generated. I need to check why TestJobInProgress hangs with this patch...
Arun C Murthy made changes - 15/Dec/08 10:40 AM
Nigel Daley made changes - 15/Dec/08 07:51 PM
Fixed the problem with the test case.
Arun C Murthy made changes - 15/Dec/08 11:36 PM
All tests pass except TestJobTrackerRestart which is tracked by
Arun C Murthy made changes - 16/Dec/08 09:32 AM
Arun C Murthy made changes - 16/Dec/08 09:56 AM
Nigel Daley made changes - 23/Apr/09 07:17 PM
Owen O'Malley made changes - 08/Jul/09 04:52 PM
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1. This is a utilization bottleneck, especially when the TaskTracker just starts up. We should be assigning atleast 50% of it's capacity.
2. If the individual tasks are very short i.e. run for less than the heartbeat interval the TaskTracker serially runs one task at a time.
3. For jobs with small maps, the TaskTracker never gets a chance to schedule reduces till all maps are complete. This means shuffle doesn't overlap with maps at all, another sore-point.
Overall, the right approach is to let the TaskTracker advertise the number of available map and reduce slots in each heartbeat and the JobTracker (i.e the Scheduler -
HADOOP-3412/HADOOP-3445) should decide how many tasks and which maps/reduces the TaskTracker should be assigned. Also, we should ensure that the TaskTracker doesn't run to the JobTracker every-time a task completes - maybe we should hard-limit to the heartbeat interval or maybe run to the JobTracker when there are more than one completed tasks in a given heartbeat interval etc.