Hadoop Common
  1. Hadoop Common
  2. HADOOP-3136

Assign multiple tasks per TaskTracker heartbeat

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.20.0
    • Component/s: None
    • Labels:
      None

      Description

      In today's logic of finding a new task, we assign only one task per heartbeat.

      We probably could give the tasktracker multiple tasks subject to the max number of free slots it has - for maps we could assign it data local tasks. We could probably run some logic to decide what to give it if we run out of data local tasks (e.g., tasks from overloaded racks, tasks that have least locality, etc.). In addition to maps, if it has reduce slots free, we could give it reduce task(s) as well. Again for reduces we could probably run some logic to give more tasks to nodes that are closer to nodes running most maps (assuming data generated is proportional to the number of maps). For e.g., if rack1 has 70% of the input splits, and we know that most maps are data/rack local, we try to schedule ~70% of the reducers there.

      Thoughts?

      1. HADOOP-3136_0_20080805.patch
        8 kB
        Arun C Murthy
      2. HADOOP-3136_1_20080809.patch
        7 kB
        Arun C Murthy
      3. HADOOP-3136_2_20080911.patch
        11 kB
        Arun C Murthy
      4. HADOOP-3136_3_20081211.patch
        24 kB
        Arun C Murthy
      5. HADOOP-3136_4_20081212.patch
        41 kB
        Arun C Murthy
      6. HADOOP-3136_5_20081215.patch
        45 kB
        Arun C Murthy

        Activity

        Hide
        Arun C Murthy added a comment -

        The problem with today's behaviour is multi-fold:
        1. This is a utilization bottleneck, especially when the TaskTracker just starts up. We should be assigning atleast 50% of it's capacity.
        2. If the individual tasks are very short i.e. run for less than the heartbeat interval the TaskTracker serially runs one task at a time.
        3. For jobs with small maps, the TaskTracker never gets a chance to schedule reduces till all maps are complete. This means shuffle doesn't overlap with maps at all, another sore-point.

        Overall, the right approach is to let the TaskTracker advertise the number of available map and reduce slots in each heartbeat and the JobTracker (i.e the Scheduler - HADOOP-3412/HADOOP-3445) should decide how many tasks and which maps/reduces the TaskTracker should be assigned. Also, we should ensure that the TaskTracker doesn't run to the JobTracker every-time a task completes - maybe we should hard-limit to the heartbeat interval or maybe run to the JobTracker when there are more than one completed tasks in a given heartbeat interval etc.

        Show
        Arun C Murthy added a comment - The problem with today's behaviour is multi-fold: 1. This is a utilization bottleneck, especially when the TaskTracker just starts up. We should be assigning atleast 50% of it's capacity. 2. If the individual tasks are very short i.e. run for less than the heartbeat interval the TaskTracker serially runs one task at a time. 3. For jobs with small maps, the TaskTracker never gets a chance to schedule reduces till all maps are complete. This means shuffle doesn't overlap with maps at all, another sore-point. Overall, the right approach is to let the TaskTracker advertise the number of available map and reduce slots in each heartbeat and the JobTracker (i.e the Scheduler - HADOOP-3412 / HADOOP-3445 ) should decide how many tasks and which maps/reduces the TaskTracker should be assigned. Also, we should ensure that the TaskTracker doesn't run to the JobTracker every-time a task completes - maybe we should hard-limit to the heartbeat interval or maybe run to the JobTracker when there are more than one completed tasks in a given heartbeat interval etc.
        Hide
        Hemanth Yamijala added a comment -

        HADOOP-657 seems related in the sense that we report free disk space to the JobTracker from the TaskTrackers and help it pick appropriate tasks. So, if we treat all these as resources - slots, memory, disk space etc, maybe we should look at a uniform way of describing these.

        Show
        Hemanth Yamijala added a comment - HADOOP-657 seems related in the sense that we report free disk space to the JobTracker from the TaskTrackers and help it pick appropriate tasks. So, if we treat all these as resources - slots, memory, disk space etc, maybe we should look at a uniform way of describing these.
        Hide
        Matei Zaharia added a comment -

        This would indeed be very helpful for utilization, as well as for reducing response time of short jobs. Now that HADOOP-3412 is committed, what needs to be done for this? Is it okay to just return multiple tasks from a scheduler, or is some support needed in the rest of the MapReduce libraries?

        Show
        Matei Zaharia added a comment - This would indeed be very helpful for utilization, as well as for reducing response time of short jobs. Now that HADOOP-3412 is committed, what needs to be done for this? Is it okay to just return multiple tasks from a scheduler, or is some support needed in the rest of the MapReduce libraries?
        Hide
        Devaraj Das added a comment -

        I don't think we need to change any of the map reduce libraries. The implementation of the scheduler should return a bunch of tasks depending on the current number of free slots the tasktracker has (maybe by running the JobInProgress.obtainNewMap/ReduceTask multiple times). It could optionally decide to do some intelligent assignment of reduces (but that could be scoped for a future jira).

        Show
        Devaraj Das added a comment - I don't think we need to change any of the map reduce libraries. The implementation of the scheduler should return a bunch of tasks depending on the current number of free slots the tasktracker has (maybe by running the JobInProgress.obtainNewMap/ReduceTask multiple times). It could optionally decide to do some intelligent assignment of reduces (but that could be scoped for a future jira).
        Hide
        Arun C Murthy added a comment -

        This is a significant problem on larger clusters (I've been testing on a 3500 node cluster) and hence needs to be fixed asap.

        Show
        Arun C Murthy added a comment - This is a significant problem on larger clusters (I've been testing on a 3500 node cluster) and hence needs to be fixed asap.
        Hide
        Matei Zaharia added a comment -

        With the new scheduler code, it should be possible to have a loop in JobQueueTaskScheduler.java to assign multiple tasks (however many map slots and reduce slots are free, all at once), right? I'll do this in the fair scheduler as well.

        Show
        Matei Zaharia added a comment - With the new scheduler code, it should be possible to have a loop in JobQueueTaskScheduler.java to assign multiple tasks (however many map slots and reduce slots are free, all at once), right? I'll do this in the fair scheduler as well.
        Hide
        Joydeep Sen Sarma added a comment -

        hey folks - there are some downsides with what's proposed here:

        • it's likely that on an idle cluster with a series of new jobs - each new job would end up on a subset of nodes (the first nodes to send heartbeats after job arrival). this is generally speaking bad:
        • jobs have different characterists and it's good to spread them around (esp. pertinent to memory usage)
        • reduce tasks have bursty network/disk traffic pattern and in general scheduling multiple of them on the same node is bad
        • map locality could get worse (would we schedule non-local jobs even if a node advertises availability of multiple slots? this is a very hard problem to solve)

        the idea of dispatching multiple scheduling decisions in one shot is cool. but the idea of making multiple scheduling decisions only considering the availability of a single node is not that great. the current scheduler limps along with this badness since it only makes one decision at a time (so what's locally optimal is usually globally optimal - esp. as far as map locality is concerned). but that would increasingly not be the case when multiple decisions need to be made in one shot.

        here's a slightly different - but even easier way of solving this that avoids this downside:

        • make the heartbeat rate from TT inversely proportional to the number idle slots on the TT.
        • the exact formula is approx.: send S/I heartbeats per second on behalf of idle slots from each TT (where I is the number of idle slots and S is the average task completion time (either parameterized, or observed).

        The reasoning is as follows:

        • in a busy cluster (lots of tasks in and out) - there is no problem with delays in dispatching tasks. And we know that scheduling decisions are reasonably satisfactory today (or at least that is an issue outside of the scope of this Jira).
        • in a busy cluster - if the average task completion time is S and total number of slots per TT is N - then on average each TT sends S/N heartbeats per second
        • so the idea is to make an idle cluster mimic a busy cluster as fat as TT/JT communication goes. We do this by pretending that the idle slots are actually busy - and thereby sending heartbeats every S/I seconds on behalf of the idle slots.

        there should not be any impact on JT scalability or overall network traffic. the network and JT should be able to bear traffic that would result from the cluster being 100% busy (and that's what we are emulating here). in addition - where we are adding traffic (cluster idling) - other traffic (DFS/shuffle) traffic would be lower.

        longer term - we should try to separate scheduling (which should consider global resource availability) from dispatching (which is a per node activity). But this seems like a much bigger problem.

        Show
        Joydeep Sen Sarma added a comment - hey folks - there are some downsides with what's proposed here: it's likely that on an idle cluster with a series of new jobs - each new job would end up on a subset of nodes (the first nodes to send heartbeats after job arrival). this is generally speaking bad: jobs have different characterists and it's good to spread them around (esp. pertinent to memory usage) reduce tasks have bursty network/disk traffic pattern and in general scheduling multiple of them on the same node is bad map locality could get worse (would we schedule non-local jobs even if a node advertises availability of multiple slots? this is a very hard problem to solve) the idea of dispatching multiple scheduling decisions in one shot is cool. but the idea of making multiple scheduling decisions only considering the availability of a single node is not that great. the current scheduler limps along with this badness since it only makes one decision at a time (so what's locally optimal is usually globally optimal - esp. as far as map locality is concerned). but that would increasingly not be the case when multiple decisions need to be made in one shot. here's a slightly different - but even easier way of solving this that avoids this downside: make the heartbeat rate from TT inversely proportional to the number idle slots on the TT. the exact formula is approx.: send S/I heartbeats per second on behalf of idle slots from each TT (where I is the number of idle slots and S is the average task completion time (either parameterized, or observed). The reasoning is as follows: in a busy cluster (lots of tasks in and out) - there is no problem with delays in dispatching tasks. And we know that scheduling decisions are reasonably satisfactory today (or at least that is an issue outside of the scope of this Jira). in a busy cluster - if the average task completion time is S and total number of slots per TT is N - then on average each TT sends S/N heartbeats per second so the idea is to make an idle cluster mimic a busy cluster as fat as TT/JT communication goes. We do this by pretending that the idle slots are actually busy - and thereby sending heartbeats every S/I seconds on behalf of the idle slots. there should not be any impact on JT scalability or overall network traffic. the network and JT should be able to bear traffic that would result from the cluster being 100% busy (and that's what we are emulating here). in addition - where we are adding traffic (cluster idling) - other traffic (DFS/shuffle) traffic would be lower. longer term - we should try to separate scheduling (which should consider global resource availability) from dispatching (which is a per node activity). But this seems like a much bigger problem.
        Hide
        Owen O'Malley added a comment -

        The scheduler would need to keep the code that figures out how many slots should be used per a node, so that you don't run with half of your cluster fully loaded and the other empty.

        We don't want to increase the heartbeat interval, we need to decrease it. In particular, we currently send a heartbeat when a task finishes. That is very costly and should stop.

        Show
        Owen O'Malley added a comment - The scheduler would need to keep the code that figures out how many slots should be used per a node, so that you don't run with half of your cluster fully loaded and the other empty. We don't want to increase the heartbeat interval, we need to decrease it. In particular, we currently send a heartbeat when a task finishes. That is very costly and should stop.
        Hide
        Owen O'Malley added a comment -

        I should expand my comment. Currently the JT does a #maps/#TT to figure out the load and won't give a TT more than 1 + avg(load) map tasks. (and similarly for reduce) Now that you can have heterogeneous clusters, we should probably change this to:

        #maps/#slots = map load
        max usable slots for a given TT = #slots on that TT * min(1.0,load) + 1

        which means that you won't load a given node to its capacity unless the cluster is full, but deals with differences in hardware speed.

        Show
        Owen O'Malley added a comment - I should expand my comment. Currently the JT does a #maps/#TT to figure out the load and won't give a TT more than 1 + avg(load) map tasks. (and similarly for reduce) Now that you can have heterogeneous clusters, we should probably change this to: #maps/#slots = map load max usable slots for a given TT = #slots on that TT * min(1.0,load) + 1 which means that you won't load a given node to its capacity unless the cluster is full, but deals with differences in hardware speed.
        Hide
        Vivek Ratan added a comment -

        Agree with Owen that it's for the scheduler to decide how many tasks to assign to a TT in one heartbeat. Joydeep, your concerns are very valid - a scheduler should not be dumb and ignore the overall load/state of the system. Otherwise you can end up with lopsided scheduling. And yes, it's fairly hard to decide whether to give the TT more than one task, and if so, what combination of Maps and Reduces from what jobs. But schedulers will need to start dealing with this. Maybe the first step is to assign more than one task only if the system is loaded, and assign tasks from different jobs (as you mention) to spread them around. At some point, a scheduler should also consider the resources available on the TT (mem, CPU) and use that to decide what combination of Map and Reduce slots should run on that node. But, as Owen says, this should be a scheduler decision and we do want to cut down on the heartbeat calls.

        Show
        Vivek Ratan added a comment - Agree with Owen that it's for the scheduler to decide how many tasks to assign to a TT in one heartbeat. Joydeep, your concerns are very valid - a scheduler should not be dumb and ignore the overall load/state of the system. Otherwise you can end up with lopsided scheduling. And yes, it's fairly hard to decide whether to give the TT more than one task, and if so, what combination of Maps and Reduces from what jobs. But schedulers will need to start dealing with this. Maybe the first step is to assign more than one task only if the system is loaded, and assign tasks from different jobs (as you mention) to spread them around. At some point, a scheduler should also consider the resources available on the TT (mem, CPU) and use that to decide what combination of Map and Reduce slots should run on that node. But, as Owen says, this should be a scheduler decision and we do want to cut down on the heartbeat calls.
        Hide
        Joydeep Sen Sarma added a comment -

        i don't know how big a change we are envisaging here. my suggestion was thinking if we wanted to minimize changes to current code base. I don't think the protocol suggested would send any more heartbeats than we do in the worst case today (or that at least it could be tuned to get that property) - but if the plan going forward is to reduce heartbeats - then that's not good enough.

        If we are planning non-trivial changes - it seems to me that it would be nice as a first step to have scheduling decisions dispatched pro-actively to slaves on new job arrival (after considering global resource availability). that itself should save on latency of scheduling on an idling cluster. on a backloaded cluster, scheduling reactively to heartbeats seems reasonable (in fact - on a backloaded cluster - it's likely that each heartbeat would only advertize one available slot. unless of course we move away from current protocol for sending heartbeat on task completion).

        the other significant improvement i can think of is maintaining a runnable task queue on each TT (in addition to the ones running). that way there would be no idle time lost to getting new tasks. the protocol of dispatching tasks to TT runnable queues could happen in the background where one could make optimizations to reduce network traffic. (for example - if the TT had a good sized queue of runnable tasks then it could afford to delay sending heartbeats on task completion (since delaying would no longer cause idle time)). This may also allow the JT to make batched scheduling decisions (instead of immediately looking for something to schedule when slots become free - the JT could wait and accumulate some larger number of free slots before making a global scheduling decision over all those slots and the available tasks. this might make things more globally optimal while still not causing idle time (since there every TT has a queue of runnable tasks)).

        Show
        Joydeep Sen Sarma added a comment - i don't know how big a change we are envisaging here. my suggestion was thinking if we wanted to minimize changes to current code base. I don't think the protocol suggested would send any more heartbeats than we do in the worst case today (or that at least it could be tuned to get that property) - but if the plan going forward is to reduce heartbeats - then that's not good enough. If we are planning non-trivial changes - it seems to me that it would be nice as a first step to have scheduling decisions dispatched pro-actively to slaves on new job arrival (after considering global resource availability). that itself should save on latency of scheduling on an idling cluster. on a backloaded cluster, scheduling reactively to heartbeats seems reasonable (in fact - on a backloaded cluster - it's likely that each heartbeat would only advertize one available slot. unless of course we move away from current protocol for sending heartbeat on task completion). the other significant improvement i can think of is maintaining a runnable task queue on each TT (in addition to the ones running). that way there would be no idle time lost to getting new tasks. the protocol of dispatching tasks to TT runnable queues could happen in the background where one could make optimizations to reduce network traffic. (for example - if the TT had a good sized queue of runnable tasks then it could afford to delay sending heartbeats on task completion (since delaying would no longer cause idle time)). This may also allow the JT to make batched scheduling decisions (instead of immediately looking for something to schedule when slots become free - the JT could wait and accumulate some larger number of free slots before making a global scheduling decision over all those slots and the available tasks. this might make things more globally optimal while still not causing idle time (since there every TT has a queue of runnable tasks)).
        Hide
        Devaraj Das added a comment -

        The runnable task queue on each TT sounds good from the point of view of globally optimal scheduling. The tasktrackers would then require to handle priority/fairness among the multiple jobs. It appears to me that handling fairness/priority is easier at the JT end.

        Show
        Devaraj Das added a comment - The runnable task queue on each TT sounds good from the point of view of globally optimal scheduling. The tasktrackers would then require to handle priority/fairness among the multiple jobs. It appears to me that handling fairness/priority is easier at the JT end.
        Hide
        Joydeep Sen Sarma added a comment -

        hmm - that would make things very complicated indeed. i really meant a queue - FIFO. the TT would just run serially off the queue subject to available slots. the JT decides the ordering off the queue entirely.

        with a single runnable queue - the major downside is that it becomes much harder to quickly respond to high priority tasks. so - one would then have to invent multiple queues (reflecting JT internal data structures) of different priorities.

        an entirely different line of attack maybe to think about this as a JT scale out (as opposed to performance) problem and figure out how to have multiple JTs. a hierarchical one is easy to think of - there's a master JT and a JT per rack perhaps. there is still some similarity with the previous scheme in that both these levels of trackers would need multiple internal priority queues. but the TT/rack-JT communication would still be high frequency (as today) - in which case - schemes that call for increasing heartbeat rate would be entirely feasible (since there's one JT per rack).

        Show
        Joydeep Sen Sarma added a comment - hmm - that would make things very complicated indeed. i really meant a queue - FIFO. the TT would just run serially off the queue subject to available slots. the JT decides the ordering off the queue entirely. with a single runnable queue - the major downside is that it becomes much harder to quickly respond to high priority tasks. so - one would then have to invent multiple queues (reflecting JT internal data structures) of different priorities. an entirely different line of attack maybe to think about this as a JT scale out (as opposed to performance) problem and figure out how to have multiple JTs. a hierarchical one is easy to think of - there's a master JT and a JT per rack perhaps. there is still some similarity with the previous scheme in that both these levels of trackers would need multiple internal priority queues. but the TT/rack-JT communication would still be high frequency (as today) - in which case - schemes that call for increasing heartbeat rate would be entirely feasible (since there's one JT per rack).
        Hide
        Arun C Murthy added a comment -

        Patch for early review while I continue testing; this will conflict with some of the other patches in flight...

        Show
        Arun C Murthy added a comment - Patch for early review while I continue testing; this will conflict with some of the other patches in flight...
        Hide
        Arun C Murthy added a comment -

        Updated to reflect recent changes to trunk...

        Show
        Arun C Murthy added a comment - Updated to reflect recent changes to trunk...
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12387886/HADOOP-3136_1_20080809.patch
        against trunk revision 685353.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3046/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3046/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3046/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3046/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12387886/HADOOP-3136_1_20080809.patch against trunk revision 685353. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3046/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3046/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3046/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3046/console This message is automatically generated.
        Hide
        Devaraj Das added a comment -

        I think that we should give less tasks of a particular type (map/reduce) when the corresponding load of that type is less. So for e.g., if a TT asks for task(s) to run, and at this point of time, we have 200 as the remaining-map-load and 1000 free map slots, we should give out just 1 task to this TT. The logic in the patch would give out more than that (to be precise, it would be the number of available slots on that TT), right? Also, should we consider the "free" slots as opposed to the total number of slots in the calculation for maxMapLoad (or reduceLoad).

        Show
        Devaraj Das added a comment - I think that we should give less tasks of a particular type (map/reduce) when the corresponding load of that type is less. So for e.g., if a TT asks for task(s) to run, and at this point of time, we have 200 as the remaining-map-load and 1000 free map slots, we should give out just 1 task to this TT. The logic in the patch would give out more than that (to be precise, it would be the number of available slots on that TT), right? Also, should we consider the "free" slots as opposed to the total number of slots in the calculation for maxMapLoad (or reduceLoad).
        Hide
        Arun C Murthy added a comment -

        I need to track a weird behaviour I see with this patch on GridMix, looks like this patch causes a bug where we hand out more tasks than we should - I need to look closer.

        Show
        Arun C Murthy added a comment - I need to track a weird behaviour I see with this patch on GridMix, looks like this patch causes a bug where we hand out more tasks than we should - I need to look closer.
        Hide
        Arun C Murthy added a comment -

        Updated patch; I ran smalljobsbenchmark on Devaraj's suggestion too and didn't notice any major drawbacks (~5%).

        Show
        Arun C Murthy added a comment - Updated patch; I ran smalljobsbenchmark on Devaraj's suggestion too and didn't notice any major drawbacks (~5%).
        Hide
        Arun C Murthy added a comment -

        Owen pointed out that we can and should decrease the heartbeat interval now, given that we have cut off the 'crazy heartbeats' i.e. heartbeats resulting from a single task's completion.

        Show
        Arun C Murthy added a comment - Owen pointed out that we can and should decrease the heartbeat interval now, given that we have cut off the 'crazy heartbeats' i.e. heartbeats resulting from a single task's completion.
        Hide
        Devaraj Das added a comment -

        Arun, did you monitor how many tasks ended up data-local/rack-local when this feature is enabled ?

        Show
        Devaraj Das added a comment - Arun, did you monitor how many tasks ended up data-local/rack-local when this feature is enabled ?
        Hide
        Arun C Murthy added a comment -

        I believe it went down from 90ish percent to around 60%. Runping?

        Show
        Arun C Murthy added a comment - I believe it went down from 90ish percent to around 60%. Runping?
        Hide
        Arun C Murthy added a comment -

        Ok, we really need to get better scheduling to go along with assigning multiple tasks – blindly assigning multiple tasks, per the current patch, has a very negative effect on locality of tasks as TaskTrackers 'steal' data-local/rack-local tasks from each other and consequently adversely affects jobs. This patch slowed GridMix down by around 30%.

        Here are some hallway ideas we threw around today for better scheduling which is imperative for assigning multiple tasks:

        1. Oversubscribe Slots

        In this approach we need to assign more than one task per slot and queue them up at the TaskTracker. This approach allows for much larger heartbeat intervals (the TaskTracker has queued-up work to finish) and helps with scaling out clusters.

        2. Pre-allocation of tasks to TaskTrackers via Global Scheduling

        Here we build up queues for each TaskTracker (and each Rack) at the JobTracker based on locality. Each task might be on multiple queues depending on which DataNode/TaskTracker it's data is present. When a TaskTracker advertises empty slots we just pick off it's list and assign it. Basically this implies that we consider the 'global' picture during scheduling and ensures that TaskTrackers do not 'steal' data-local tasks from each other.

        Of course to ensure that the highest-priority job doesn't get starved we need to assign atleast one of it's tasks on each TaskTracker's heartbeat, even if it means we schedule an off-rack task. Similarly, to avoid spreading task-allocation too thin i.e. across too many jobs, we need to ensure that that the TaskTrackers' lists only contain tasks from a reasonably small set of the highest-priority jobs.

        3. HADOOP-2014: Scheduling off-rack tasks


        Thoughts?

        Show
        Arun C Murthy added a comment - Ok, we really need to get better scheduling to go along with assigning multiple tasks – blindly assigning multiple tasks, per the current patch, has a very negative effect on locality of tasks as TaskTrackers 'steal' data-local/rack-local tasks from each other and consequently adversely affects jobs. This patch slowed GridMix down by around 30%. Here are some hallway ideas we threw around today for better scheduling which is imperative for assigning multiple tasks: 1. Oversubscribe Slots In this approach we need to assign more than one task per slot and queue them up at the TaskTracker. This approach allows for much larger heartbeat intervals (the TaskTracker has queued-up work to finish) and helps with scaling out clusters. 2. Pre-allocation of tasks to TaskTrackers via Global Scheduling Here we build up queues for each TaskTracker (and each Rack) at the JobTracker based on locality. Each task might be on multiple queues depending on which DataNode/TaskTracker it's data is present. When a TaskTracker advertises empty slots we just pick off it's list and assign it. Basically this implies that we consider the 'global' picture during scheduling and ensures that TaskTrackers do not 'steal' data-local tasks from each other. Of course to ensure that the highest-priority job doesn't get starved we need to assign atleast one of it's tasks on each TaskTracker's heartbeat, even if it means we schedule an off-rack task. Similarly, to avoid spreading task-allocation too thin i.e. across too many jobs, we need to ensure that that the TaskTrackers' lists only contain tasks from a reasonably small set of the highest-priority jobs. 3. HADOOP-2014 : Scheduling off-rack tasks Thoughts?
        Hide
        Matei Zaharia added a comment -

        At the very least, it would help to schedule at least one reduce and one map per heartbeat, because those are pretty independent. This seemed to make a difference in my tests using Gridmix at Facebook because Gridmix would some periods of time when it wouldn't launch reduces at all. Beyond that, for maps, maybe it would also help to limit the number you launch per heartbeat (say to 2), or to ask the job for tasks until it stops giving you local tasks? The current patch seems to go all out asking for tasks from the top job in the queue.

        Show
        Matei Zaharia added a comment - At the very least, it would help to schedule at least one reduce and one map per heartbeat, because those are pretty independent. This seemed to make a difference in my tests using Gridmix at Facebook because Gridmix would some periods of time when it wouldn't launch reduces at all. Beyond that, for maps, maybe it would also help to limit the number you launch per heartbeat (say to 2), or to ask the job for tasks until it stops giving you local tasks? The current patch seems to go all out asking for tasks from the top job in the queue.
        Hide
        Arun C Murthy added a comment -

        To clarify: In the 'Pre-Allocation' scheme the per-TaskTracker list has tasks sorted by the priority of their jobs. It is ok to not update all lists simultaneously, we just need to mark the TaskInProgress on allocation and they can subsequently be deleted from other lists after checking to ensure they have already been allocated. Similarly, we need to add the TIP back on all lists on task failure. Essentially it moves the current Job-specific caches to a global per-TaskTracker lists.

        Clearly we need further thought and discussions along these lines. Should we target anything quick/dirty for 0.19.0?

        Show
        Arun C Murthy added a comment - To clarify: In the 'Pre-Allocation' scheme the per-TaskTracker list has tasks sorted by the priority of their jobs. It is ok to not update all lists simultaneously, we just need to mark the TaskInProgress on allocation and they can subsequently be deleted from other lists after checking to ensure they have already been allocated. Similarly, we need to add the TIP back on all lists on task failure. Essentially it moves the current Job-specific caches to a global per-TaskTracker lists. Clearly we need further thought and discussions along these lines. Should we target anything quick/dirty for 0.19.0?
        Hide
        Arun C Murthy added a comment -

        Matei, I agree that assigning atleast one map and one reduce will help - at least in the single reducer case. A good short-term goal.

        Beyond that, for maps, maybe it would also help to limit the number you launch per heartbeat (say to 2)

        This does have implications for the removal of the 'crazy heartbeats' (i.e. currently the TaskTrackers send a heartbeat out as soon as any task completes). Maybe we should send a heartbeat immediately when a TaskTracker notices that the last running map completed?

        [...] or to ask the job for tasks until it stops giving you local tasks

        We do need to be slightly careful here: if we do agree that 'global scheduling' is the right approach then I'd rather not introduce short-term hacks to pull tricks like those. I'm hoping to get everyone to think about a long-term solution and keep that in mind while we target short-term fixes, does that make sense?

        Show
        Arun C Murthy added a comment - Matei, I agree that assigning atleast one map and one reduce will help - at least in the single reducer case. A good short-term goal. Beyond that, for maps, maybe it would also help to limit the number you launch per heartbeat (say to 2) This does have implications for the removal of the 'crazy heartbeats' (i.e. currently the TaskTrackers send a heartbeat out as soon as any task completes). Maybe we should send a heartbeat immediately when a TaskTracker notices that the last running map completed? [...] or to ask the job for tasks until it stops giving you local tasks We do need to be slightly careful here: if we do agree that 'global scheduling' is the right approach then I'd rather not introduce short-term hacks to pull tricks like those. I'm hoping to get everyone to think about a long-term solution and keep that in mind while we target short-term fixes, does that make sense?
        Hide
        Matei Zaharia added a comment -

        If you guys have a Gridmix cluster lying around, I'm very curious about how well a 1-map/1-reduce policy does and how well a 2-map/1-reduce policy does. Clearly this patch is hard to evaluate without tests on large clusters.

        Show
        Matei Zaharia added a comment - If you guys have a Gridmix cluster lying around, I'm very curious about how well a 1-map/1-reduce policy does and how well a 2-map/1-reduce policy does. Clearly this patch is hard to evaluate without tests on large clusters.
        Hide
        Runping Qi added a comment -

        The simplest remedy to the current patch is to assign at most one non-local mapper at each heart beat (if multiple non-local mappers are selected, throw away all but one).

        Nest refinement will be that, if the current job does not have enough data/rack local mappers for a TT, the job tracker can get some node-local/rack-local mappers from the next few (say 5) jobs, if available.

        Each heart beat should deliver one reducer to a TT

        Show
        Runping Qi added a comment - The simplest remedy to the current patch is to assign at most one non-local mapper at each heart beat (if multiple non-local mappers are selected, throw away all but one). Nest refinement will be that, if the current job does not have enough data/rack local mappers for a TT, the job tracker can get some node-local/rack-local mappers from the next few (say 5) jobs, if available. Each heart beat should deliver one reducer to a TT
        Hide
        Runping Qi added a comment -

        About the crazy heart beat: I think TT can employ some basic intelligence.
        For example, the TT can decide not to send crazy heart beat if it has a few tasks running.

        In general, the Job tracker should tell TT how busy it is. The TT should use that time to decide when to send the next heart beat.

        Show
        Runping Qi added a comment - About the crazy heart beat: I think TT can employ some basic intelligence. For example, the TT can decide not to send crazy heart beat if it has a few tasks running. In general, the Job tracker should tell TT how busy it is. The TT should use that time to decide when to send the next heart beat.
        Hide
        Owen O'Malley added a comment -

        Stealing from the lower priority jobs is bad. First it increases the load on the temporary disk space. Second, it hurts HADOOP-249, which will only reuse jvms from the same job. I'm ok with the heuristic of only assigning local tasks after the first of each kind.

        Show
        Owen O'Malley added a comment - Stealing from the lower priority jobs is bad. First it increases the load on the temporary disk space. Second, it hurts HADOOP-249 , which will only reuse jvms from the same job. I'm ok with the heuristic of only assigning local tasks after the first of each kind.
        Hide
        Joydeep Sen Sarma added a comment -

        +10 on #2

        it's not surprising at all that the initial proposal in this jira makes locality worse. one of our concerns with scheduling right now is the awful locality for small jobs that results from greedy allocation. to that extent - would really welcome any attempt at global scheduling.
        the runnable queue per TT sounds very much like https://issues.apache.org/jira/browse/HADOOP-3136?focusedCommentId=12618922#action_12618922

        the queues should also allow JT to make scheduling decisions for a bunch of jobs at one time - instead of for the job as soon as it arrives. (this can be based on how busy the cluster is - with a idle cluster scheduling decisions are better made immediately - with a busy cluster - one can club things together and then decide since the nodes anyway have work to do).

        this is also an opportunity to attack JT scalability. why keep the queues at the JT only? send them to the TT as well. That way - we don't have to send a heartbeat each time a slot frees up - those can also be batched up by the TT without incurring any idle time (again because of the runnable queues) and should help JT bottlenecks.

        there are a lot of nuts that can be cracked at once here - so might be worth more thought.

        Show
        Joydeep Sen Sarma added a comment - +10 on #2 it's not surprising at all that the initial proposal in this jira makes locality worse. one of our concerns with scheduling right now is the awful locality for small jobs that results from greedy allocation. to that extent - would really welcome any attempt at global scheduling. the runnable queue per TT sounds very much like https://issues.apache.org/jira/browse/HADOOP-3136?focusedCommentId=12618922#action_12618922 the queues should also allow JT to make scheduling decisions for a bunch of jobs at one time - instead of for the job as soon as it arrives. (this can be based on how busy the cluster is - with a idle cluster scheduling decisions are better made immediately - with a busy cluster - one can club things together and then decide since the nodes anyway have work to do). this is also an opportunity to attack JT scalability. why keep the queues at the JT only? send them to the TT as well. That way - we don't have to send a heartbeat each time a slot frees up - those can also be batched up by the TT without incurring any idle time (again because of the runnable queues) and should help JT bottlenecks. there are a lot of nuts that can be cracked at once here - so might be worth more thought.
        Hide
        eric baldeschwieler added a comment -

        What if the heartbeat was really cheap? If all it did was informed the JT of the opportunity to schedule a task to the node and returned task compete info, than I don't think we'd have a problem with a heartbeat per task complete.

        I'd advocate this. Decouple scheduling decisions from heartbeats. The JT could the do whatever scheduling it chose and assign new tasks to TT directly when ready.

        The problem with assigning many local tasks to a TT on a heartbeat is that you could still end up assigning many tasks to a few nodes and none to others and getting slower execution. We still might want to try that since, the results could still be better than any of our current options and it is simple to code and understand.

        The suggestion:

        If (node local tasks available)

        { assign as many node local tasks as available to the TT }

        elseif (switch local task is available)
        assign one switch local task
        } else

        { assign one remote task }

        I'm sure we could do better, but this is simple and worth trying.

        I think it is clear we're going to need a global scheduler that plans the allocation of a jobs tasks to nodes globally and then monitors execution and adjusts the plan.

        On reduces, sounds good to allocate those separately.

        Show
        eric baldeschwieler added a comment - What if the heartbeat was really cheap? If all it did was informed the JT of the opportunity to schedule a task to the node and returned task compete info, than I don't think we'd have a problem with a heartbeat per task complete. I'd advocate this. Decouple scheduling decisions from heartbeats. The JT could the do whatever scheduling it chose and assign new tasks to TT directly when ready. — The problem with assigning many local tasks to a TT on a heartbeat is that you could still end up assigning many tasks to a few nodes and none to others and getting slower execution. We still might want to try that since, the results could still be better than any of our current options and it is simple to code and understand. The suggestion: If (node local tasks available) { assign as many node local tasks as available to the TT } elseif (switch local task is available) assign one switch local task } else { assign one remote task } I'm sure we could do better, but this is simple and worth trying. — I think it is clear we're going to need a global scheduler that plans the allocation of a jobs tasks to nodes globally and then monitors execution and adjusts the plan. — On reduces, sounds good to allocate those separately.
        Hide
        Robert Chansler added a comment -

        Not for 0.19.

        Show
        Robert Chansler added a comment - Not for 0.19.
        Hide
        Arun C Murthy added a comment -

        Ok, with HADOOP-249 getting in it's imperative for us to consider jvm-reuse in the scheduling algorithms too - it really behooves us to think much harder about the global scheduling problem and it probably requires a prototype or two to really get under the skin of the beast. I'll go ahead and open a new jira for that.

        Meanwhile, given the benefits of assigning multiple tasks, I'd propose we go ahead and do a quick-fix along the lines suggested by Runping/Eric. It clearly is a win over what we have currently. Thoughts?

        Show
        Arun C Murthy added a comment - Ok, with HADOOP-249 getting in it's imperative for us to consider jvm-reuse in the scheduling algorithms too - it really behooves us to think much harder about the global scheduling problem and it probably requires a prototype or two to really get under the skin of the beast. I'll go ahead and open a new jira for that. Meanwhile, given the benefits of assigning multiple tasks, I'd propose we go ahead and do a quick-fix along the lines suggested by Runping/Eric. It clearly is a win over what we have currently. Thoughts?
        Hide
        Owen O'Malley added a comment -

        +1 on using the heuristic and looking at the scheduling problems in a bigger context.

        Show
        Owen O'Malley added a comment - +1 on using the heuristic and looking at the scheduling problems in a bigger context.
        Hide
        eric baldeschwieler added a comment -

        +1 on TESTING the heuristic. I'm worried about cases where we over schedule the first half of the cluster and leave the other half idle or remote scheduled.

        Show
        eric baldeschwieler added a comment - +1 on TESTING the heuristic. I'm worried about cases where we over schedule the first half of the cluster and leave the other half idle or remote scheduled.
        Hide
        Matei Zaharia added a comment -

        Eric, regarding the cluster overloading, can't we limit the number of slots we use per node when the cluster is undersubscribed, the same way the current one-task-at-a-time scheduler does? As far as I understand, the current scheduler limits the number of tasks it runs on a node to min (number of slots per node, (total number of tasks) / (total number of slots)). For example, if you have a cluster with 50 machines with 4 slots each (i.e. 200 slots in total), and you submit a job with only 100 maps, it will only launch up to 2 maps on each node.

        Show
        Matei Zaharia added a comment - Eric, regarding the cluster overloading, can't we limit the number of slots we use per node when the cluster is undersubscribed, the same way the current one-task-at-a-time scheduler does? As far as I understand, the current scheduler limits the number of tasks it runs on a node to min (number of slots per node, (total number of tasks) / (total number of slots)). For example, if you have a cluster with 50 machines with 4 slots each (i.e. 200 slots in total), and you submit a job with only 100 maps, it will only launch up to 2 maps on each node.
        Hide
        eric baldeschwieler added a comment -

        Hi Matei,

        Don't let me discourage you from getting some experimental data on
        this! I'm sure we can find some simple heuristics that are better
        than hat we have now. Even on a loaded cluster, the right decision
        per node may require more global information. For example, is it a
        good strategy to run all the tasks for the first job in the queue on a
        subset of the nodes in the cluster? Maybe it is. But you probably
        want to be thinking a bit about what the shuffle phase is going to
        look like. You may not want very unbalanced racks / nodes. Or in
        some cases perhaps that unbalance is optimal!

        If a job only runs on 1/5 of the nodes on a cluster, you may amortize
        a lot of startup costs that you pay per node over more tasks / node.
        This might be really good! Costs to copy jars and start VMs for
        example can be significant.

        If you have 1000 maps to run on 1000 nodes, your might like to fill
        every map slot on a subset of the nodes with at least 2 generations of
        maps, rather than running one map / node. This could give you much
        better throughput. This would support your idea.

        On the other hand, you need to make sure you keep your node and rack
        locality high, or that strategy might work really badly. One could
        imagine your total locality going down a lot for your tail jobs, as we
        just observed. Also, you might create shuffle hotspots, which could
        kill your HDFS performance and slow down total throughput...

        E14

        Show
        eric baldeschwieler added a comment - Hi Matei, Don't let me discourage you from getting some experimental data on this! I'm sure we can find some simple heuristics that are better than hat we have now. Even on a loaded cluster, the right decision per node may require more global information. For example, is it a good strategy to run all the tasks for the first job in the queue on a subset of the nodes in the cluster? Maybe it is. But you probably want to be thinking a bit about what the shuffle phase is going to look like. You may not want very unbalanced racks / nodes. Or in some cases perhaps that unbalance is optimal! If a job only runs on 1/5 of the nodes on a cluster, you may amortize a lot of startup costs that you pay per node over more tasks / node. This might be really good! Costs to copy jars and start VMs for example can be significant. If you have 1000 maps to run on 1000 nodes, your might like to fill every map slot on a subset of the nodes with at least 2 generations of maps, rather than running one map / node. This could give you much better throughput. This would support your idea. On the other hand, you need to make sure you keep your node and rack locality high, or that strategy might work really badly. One could imagine your total locality going down a lot for your tail jobs, as we just observed. Also, you might create shuffle hotspots, which could kill your HDFS performance and slow down total throughput... E14
        Hide
        Arun C Murthy added a comment -

        Here is a patch which fixes the default scheduler to assign multiple tasks per heartbeat. The patch ensures that no more than one off-rack task is handed per heartbeat, changes the TaskTracker to ensure that it doesn't send a heartbeat on every task completion and reduces the heartbeat interval from a ratio of 1s per 50 trackers to 1s per 100 trackers.

        Show
        Arun C Murthy added a comment - Here is a patch which fixes the default scheduler to assign multiple tasks per heartbeat. The patch ensures that no more than one off-rack task is handed per heartbeat, changes the TaskTracker to ensure that it doesn't send a heartbeat on every task completion and reduces the heartbeat interval from a ratio of 1s per 50 trackers to 1s per 100 trackers.
        Hide
        eric baldeschwieler added a comment -

        Why assign an off rack task if you could also assign on rack? Seems like the ability to assign an on rack task indicates that other TT might benefit from having a chance to play as well. Have you tested these two strategies?

        It seems like the TT should notify the JT whenever any task fails. A server should be able to deal with thousands of notifications / sec and more complete knowledge should allow better behavior. The problem seems to be that processing these heartbeats is too expensive? This would argue for a more complete JT redesign. We should probably start another thread on that, where the JT plans and queues tasks for task trackers in one thread. Heartbeats would then be very cheap, since it would just be a state update on a few tasks and a dispatch of already queued work.

        Show
        eric baldeschwieler added a comment - Why assign an off rack task if you could also assign on rack? Seems like the ability to assign an on rack task indicates that other TT might benefit from having a chance to play as well. Have you tested these two strategies? It seems like the TT should notify the JT whenever any task fails. A server should be able to deal with thousands of notifications / sec and more complete knowledge should allow better behavior. The problem seems to be that processing these heartbeats is too expensive? This would argue for a more complete JT redesign. We should probably start another thread on that, where the JT plans and queues tasks for task trackers in one thread. Heartbeats would then be very cheap, since it would just be a state update on a few tasks and a dispatch of already queued work.
        Hide
        Arun C Murthy added a comment -

        Why assign an off rack task if you could also assign on rack?

        The strategy is to assign as many node-local and rack-local tasks per heartbeat as available slots, however no more than 1 off-switch task per heartbeat regardless of the number of available slots. To prevent under-utilization of the TaskTracker (which might have got only one off-switch task) this patch halves the heartbeat interval (which we can afford now, given that this patch removes the heartbeat on every task completion).

        The problem seems to be that processing these heartbeats is too expensive? This would argue for a more complete JT redesign. Heartbeats would then be very cheap, since it would just be a state update on a few tasks and a dispatch of already queued work.

        Agreed, we would need to implement finer-grained locking in the JT to reduce the cost of heartbeat processing - currently each heartbeat locks up the entire JT (ala the BFLock in the Linux 2.2 kernel... smile). HADOOP-869 tracks this.

        We should probably start another thread on that, where the JT plans and queues tasks for task trackers in one thread.

        +1. I'll open another jira for global scheduling.

        Show
        Arun C Murthy added a comment - Why assign an off rack task if you could also assign on rack? The strategy is to assign as many node-local and rack-local tasks per heartbeat as available slots, however no more than 1 off-switch task per heartbeat regardless of the number of available slots. To prevent under-utilization of the TaskTracker (which might have got only one off-switch task) this patch halves the heartbeat interval (which we can afford now, given that this patch removes the heartbeat on every task completion). The problem seems to be that processing these heartbeats is too expensive? This would argue for a more complete JT redesign. Heartbeats would then be very cheap, since it would just be a state update on a few tasks and a dispatch of already queued work. Agreed, we would need to implement finer-grained locking in the JT to reduce the cost of heartbeat processing - currently each heartbeat locks up the entire JT (ala the BFLock in the Linux 2.2 kernel... smile ). HADOOP-869 tracks this. We should probably start another thread on that, where the JT plans and queues tasks for task trackers in one thread. +1. I'll open another jira for global scheduling.
        Hide
        Owen O'Malley added a comment -

        +1

        Show
        Owen O'Malley added a comment - +1
        Hide
        Arun C Murthy added a comment -

        Fixed some test cases which assumed only one task was being assigned per heartbeat... I've changed TestJobTrackerRestart.testtestRecoveryWithMultipleJobs to ignore the finish times as they are unpredicatable with fast-start of tasks now.

        Show
        Arun C Murthy added a comment - Fixed some test cases which assumed only one task was being assigned per heartbeat... I've changed TestJobTrackerRestart.testtestRecoveryWithMultipleJobs to ignore the finish times as they are unpredicatable with fast-start of tasks now.
        Hide
        Devaraj Das added a comment - - edited

        I guess I am too late in commenting on this. But one thing that might be worth doing is to go and ask for a bunch of tasks as soon as the TT empties its queue of tasks (TaskTracker.TaskLauncher.tasksToLaunch is the queue). We batch the request for new tasks on the basis that the TT started running all the queued tasks and we should now backfill the queue. That way, the TT would never really be idle. This might be important for GridMix kind of applications where there are many small tasks...

        Show
        Devaraj Das added a comment - - edited I guess I am too late in commenting on this. But one thing that might be worth doing is to go and ask for a bunch of tasks as soon as the TT empties its queue of tasks (TaskTracker.TaskLauncher.tasksToLaunch is the queue). We batch the request for new tasks on the basis that the TT started running all the queued tasks and we should now backfill the queue. That way, the TT would never really be idle. This might be important for GridMix kind of applications where there are many small tasks...
        Hide
        Matei Zaharia added a comment -

        I had one comment.. is it possible to change the "findLocalTask" parameter in JobInProgress.findNewMapTask from a boolean to an int maxLevel for the maximum level in the topology where you can look? This would be useful for being able to search for data-local tasks before rack-local tasks. Now data vs rack locality shouldn't make much of a difference in most workloads, but I had some tests at Facebook where data-local tasks seemed to be faster, and this is a small change that will leave this possibility open.

        Show
        Matei Zaharia added a comment - I had one comment.. is it possible to change the "findLocalTask" parameter in JobInProgress.findNewMapTask from a boolean to an int maxLevel for the maximum level in the topology where you can look? This would be useful for being able to search for data-local tasks before rack-local tasks. Now data vs rack locality shouldn't make much of a difference in most workloads, but I had some tests at Facebook where data-local tasks seemed to be faster, and this is a small change that will leave this possibility open.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12395984/HADOOP-3136_4_20081212.patch
        against trunk revision 726129.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 9 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3738/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3738/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3738/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3738/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12395984/HADOOP-3136_4_20081212.patch against trunk revision 726129. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3738/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3738/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3738/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3738/console This message is automatically generated.
        Hide
        Arun C Murthy added a comment -

        I need to check why TestJobInProgress hangs with this patch...

        Show
        Arun C Murthy added a comment - I need to check why TestJobInProgress hangs with this patch...
        Hide
        Arun C Murthy added a comment -

        Fixed the problem with the test case.

        Show
        Arun C Murthy added a comment - Fixed the problem with the test case.
        Hide
        Arun C Murthy added a comment -

        All tests pass except TestJobTrackerRestart which is tracked by HADOOP-4879, I've also opened HADOOP-4880 to track improvements to TestJobTrackerRestart which has tests features specific to the scheduler prior to this patch (ie. assumes a particular scheduling order which is no longer relevant).

        Show
        Arun C Murthy added a comment - All tests pass except TestJobTrackerRestart which is tracked by HADOOP-4879 , I've also opened HADOOP-4880 to track improvements to TestJobTrackerRestart which has tests features specific to the scheduler prior to this patch (ie. assumes a particular scheduling order which is no longer relevant).
        Hide
        Arun C Murthy added a comment -

        I just committed this.

        Show
        Arun C Murthy added a comment - I just committed this.

          People

          • Assignee:
            Arun C Murthy
            Reporter:
            Devaraj Das
          • Votes:
            0 Vote for this issue
            Watchers:
            19 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development