|
This jira isn't very clear. What are you proposing changing? Is it to make the mapred.tasktracker.tasks.maxmemory pluggable? If so, I'd propose making an interface like:
abstract class MemoryPlugin { long getVirtualMemorySize(Configuration conf); } and you configure an implementation of it. (mapred.server.memory.plugin ?) I'm proposing a couple of improvements:
Attaching a patch. This
Messed up the approach. Here's another patch that gets it right. It
The latest patch kills only the last task that started if the sum total of all tasks' memory usage goes beyond the configured limit. Picking up only one task may or may not bring down the usage to within the configured limits. Should we really be picking up enough tasks to kill ?
Yes. Kill one or more so you go below the limit, as mentioned in the summary in
Attaching a new patch to address this. TaskMemoryManagerThread now calls TaskTracker.findTaskToKill() repeatedly to find a few tasks with the least progress so as to bring down the total memory usage of all tasks falls below TT's limit, and then kills them. Modified the signature of TaskTracker.findTaskToKill() to TaskTracker.findTaskToKill(List<TaskAttempId> tasksToExclude) so as to help excluding tasks that are already marked for killing. -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12393629/HADOOP-4523-20081110.txt against trunk revision 712615. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3569/testReport/ This message is automatically generated.
Attaching another patch with the above review comments.
Didn't write a separate testMixedSetExceedingLimits - it seemed to me that it's not adding any value, for it is already being indirectly incorporated in the other two independent tests that verify the tasks' limits and the TT limits. +1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12393859/HADOOP-4523-20081113.txt against trunk revision 713893. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3590/testReport/ This message is automatically generated. Code looks good to me. +1
Vinod, while going over this patch with Devaraj for a quick check, we thought it will be nice to split up the run method in the TaskMemoryManagerThread into a couple of smaller methods, just to ease readability. The rest of the changes are still fine. Can you please submit a new patch with this minor change ?
`ant test-patch` results: [exec] +1 overall.
[exec] +1 @author. The patch does not contain any @author tags.
[exec] +1 tests included. The patch appears to include 3 new or modified tests.
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
Going to run it through hudson once more.
I just committed this. Thanks, Vinod !
Integrated in Hadoop-trunk #665 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/665/
. Prevent too many tasks scheduled on a node from bringing it down by monitoring for cumulative memory usage across tasks. Contributed by Vinod Kumar Vavilapalli -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12394152/HADOOP-4523-20081118.txt against trunk revision 719431. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3618/console This message is automatically generated. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HADOOP-3759provides a configuration value, mapred.tasktracker.tasks.maxmemory, which specifies the total VM on a machine available to tasks spawned by the TT. Along withHADOOP-4439, it provides a cluster-wide default for the maximum VM associated per task, mapred.task.default.maxmemory. This value can be overridden by individual jobs.HADOOP-3581implements a monitoring mechanism that kill tasks if they go over their maxmemory value. Keeping all this in mind, here's a proposal for what we need to additionally do:If tasks.maxmemory is set, the TT monitors the total memory usage of all tasks spawned by the TT. If this value goes over tasks.maxmemory, the TT needs to kill one or more tasks. It first looks for tasks whose individual memory is over their default.maxmemory value. These are killed (while you may ideally want to kill just enough that your total memory usage comes down, it's not obvious which of these violators you choose to kill, so it's probably simpler to kill all). If no such task is found, or if killing one or more of these tasks still takes us over the memory limit, we need to pick other tasks to kill. There are many ways to do this. Probably the easiest is to kill tasks that ran most recently.
Tasks that are killed because they went over their memory limit should be treated as failed, since they violated their contract. Tasks that are killed because the sum total of memory usage was over a limit should be treated as killed, since it's not really their fault.
Another improvement is to let mapred.tasktracker.tasks.maxmemory be set by an external script, which lets Ops control what this value should be. A slightly less desirable option, as indicated in some offline discussions with Allen W, is to set this value to be an absolute number ("hadoop may use X amount") or an offset of the total amount of memory on the machine ("hadoop may use all but 4g").