I've read through this jira a few times, and looked at some of the previously mentioned jira's around memory limits. I think I see where the issue is actually at.
Arun started to fill in the historical background, but I think he may have missed a significant point. Let's retell the story, so that we can get to cruxt of the ops requirement here....
Under HOD w/torque, we configured torque such that it would limit the virtual memory size to be total vm - 4gb. [This left plenty of ram for LInux, our monitoring software, etc, etc. So on a machine with 4x4GB swap partitions and 16GB RAM, the vm limit would be set to 28GB]. Now the thing about hod is that it allocates the
entire node to an entire job.... which means there is a subtle point here, easily missed: the vm limit under torque was the aggregate for all of the tasks on the node, not just a single task. So if you had a bad behaving task/job, it will kill all the tasks running on that node.
To simulate this ops requirement, hadoop should be taking the memory used by all the tasks and then performing some action. While I realize there is a desire to only punish 'bad tasks', I'm not sure if there is an easy way to do that. Putting my jack boots on, my answer is Kill Them All and Let The Users Sort Themselves Out. If I have to pick between killing the system (we're talking hard hang here, not happy little panic in my experiences) and punishing potentially innocent users, the answer is easy.
Now here is where things get more complex, and there is a very good chance I've gotten this wrong. [Hopefully I have, because it sounds to me like a feature was added in the wrong spot.]
It sounds like capacity has the ability to kill tasks based upon vm per node. It has this idea of a max vm size and how much memory each task is asking for. It then schedules based upon a weird slot+mem ratio.
While this is a fine and dandy feature that would likely fix the requestors problem, I think it is a bit short sighted not have the kill feature at the task tracker level. The task tracker, regardless of scheduler, should still be able to keep track of all the mem used on the box and kill as necessary. If a scheduler wants to provide alternative logic, more power to it. But tying this to a scheduler just seems a bit ridiculous.