|
It looks like mapred.jobtracker.maxtasks.per.job is a global value. Would be interesting to have each job optionally override this value for itself.
This jira is probably asking for a feature that allows limiting the number of concurrently running tasks of a job. A job could have a million tasks, but if we can somehow tell the system to not schedule more than 100 tasks of this job concurrently, then your problem is solved, isn't it?
Yes, a limit on the number of concurrently running tasks, cluster wide, specified by and enforced against given job would be great.
But also having the option to set the number of concurrently running tasks, node wide, for a give job is also important. Nathan thinks this might help with his issue as well on HADOOP-5160. So a job with a million tasks, only 100 can run concurrently in the cluster, and only 1 can run per node. +1 on Chris' thoughts above.
Ideally we'd like to have as much control as possible (per job, max concurrent tasks across cluster and/or per node). Either of these would satisfy my requirements, so if one fits more easily into the existing scheduler/jobtracker, I think we should go after that approach. The approach of Vinod does not help in my base use case because I want to run avg 1 cpu-bound task per node (cluster max = nodes) but my network latency bound jobs I'd like to run 5 or more per node (cluster max = 5 * nodes). We are using threading in some places but it's significant complexity in many already complex MapReduce jobs. I'm also interested in this - we want to import from a database using map-reduce, but it'd be nice to tune on a per-job basis.
My use case is a job with a custom output format which utilizes local disk heavily. Jobs that should take half an hour with one task per node are taking 4 hours due to disk contention. There's no feature I want more in Hadoop than this one.
Here is a start at a patch for this issue. I added limits on running maps and reduces in the form of four parameters:
All the limits start at infinity by default (meaning no limit other than the number of slots on the node, as happens today). These limits are located in JobInProgress and affect whether obtainNewMapTask and obtainNewReduceTask succeed. They will therefore work with any job scheduler (default FIFO scheduler, fair scheduler or capacity scheduler). For example, setting the per-cluster limit for a job under the FIFO scheduler will mean that this job will consume a certain number of slots (even if it has more tasks than this number of slots), and the other slots can be used by later jobs in the queue. Let me know whether this approach looks good and whether the names for the parameters make sense. I can then maybe move the parameter strings into JobConf methods so they don't appear right in JobInProgress. I'm not intimately familiar with this area of the codebase, but I filed the issue.
The approach seems simple and straightforward to me, +1. Patch looks good. I had the impression this would be a bit harder to do than that. These four settings should fulfill all the different use cases i've heard for this issue. I will test the latest patch on monday. Thanks Matei! -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12407054/tasklimits.patch against trunk revision 770685. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/282/testReport/ This message is automatically generated. Matei, wanted to understand this a bit more..
Another use case that I heard of is the following: If I have a CPU bound job, I may want to restrict the number of tasks that run on a node, so I can use the cores all for myself. So, if I set mapred.max.maps.per.node for this job appropriately, the patch makes sure that only those many tasks of this job are scheduled on the node, even if slots are free. However, another job that has no limits set can get scheduled on the free slots. Then, the exclusiveness is not met, right ? So, is the above use case not handled by this patch, or can it be accomplished in some way still. Hemanth, if you submit another job that is also CPU-bound, it may interfere with the first. However, if you submit one that is IO-bound, it will be fine. This task limit feature isn't meant to solve the general resource allocation problem, only to give you a way to limit resource consumption if you know that you have one job with very resource-intensive tasks and many jobs with less resource-intensive tasks. Because it's such a simple feature, I think it's a good one to add before building any kind of automatic resource-aware scheduling. It will solve many users' problems in the short term, as evidenced by the number of votes and watchers.
bq It will solve many users' problems in the short term, as evidenced by the number of votes and watchers.
Indeed, I know about many shops that use their cluster for one job at a time and that need that kind of sweet feature. Here's a new patch with a few changes suggested by Tom White:
I also renamed the "per cluster" limit parameters to mapred.running.map.limit and mapred.running.reduce.limit, which I are clearer names (you only have one cluster, so per cluster sounds strange). +1
The documentation changes will go to http://hadoop.apache.org/core/docs/current/mapred-default.html Regarding unit testing, can you unit test JobInProgress's obtainNewMapTask() and obtainNewReduceTask() directly? -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12408326/tasklimits-v2.patch against trunk revision 776904. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. -1 release audit. The applied patch generated 492 release audit warnings (more than the trunk's current 491 warnings). +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/367/testReport/ This message is automatically generated. Here's a new patch with unit tests. I used MiniMRCluster because it's hard to test obtainNewMapTask/ReduceTask directly without duplicating a lot of the code in MiniMRCluster (setting up a JobTracker and a TaskTracker).
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12408747/tasklimits-v3.patch against trunk revision 777761. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. -1 release audit. The applied patch generated 492 release audit warnings (more than the trunk's current 491 warnings). -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/392/testReport/ This message is automatically generated. I'm still on 0.19 and really need this patch.
Attached is a patch that applies cleanly to current 0.19 branch. Note: This removes the unit test! I tried to get the test to work in 0.19 but eventually turned to doing manual testing on my cluster with my jobs. All my testing shows that the patch works as advertised. I'm moving this into a busy production system and will report back if there are any issues. It looks like this broke the TestJobHistory test. I checked out trunk and it passed, then applied this patch and it failed.
Sorry, false alarm. Looks like TestJobHistory is currently flaky (
Attached is a backport of this patch against 0.18.3 including the new unit tests, which pass. There are some backports here of new test code in UtilsForTests, etc, required for the tests to run.
I have a minor nit - the code in JobInProgress.findNewMapTask/findNewReduceTask that this patch adds is very similar and probably can be factored out to a separate method with the appropriate args.
Other than that, in the testcase, there are big waits (and the testcase takes ~3 minutes to run). Are they required to be so long. Also, in general, we should move to the model of spoofing heartbeats (and faking other objects) in such testcases but I won't hold this patch up for that (unless there is enthusiasm to modify the test in that direction). I had to make the waits big because the inter-heartbeat interval is big. I can remove some of the test cases though (having just the first 2 might be enough). I'd really rather not dive into modifying test harnesses as part of this patch, but I think that would be a great feature to add to MiniMRCluster in a separate JIRA.
Here's a new patch that refactors the limit checking logic into a single method. I've also removed the test for limits being larger than slot counts because that is tested in other mapred tests (where we do not have any task limits set).
Also, I haven't been able to figure out what the release audit warning is (if it was valid) because the link posted by Hudson doesn't work. Let's see if it comes up again. I don't see any Java warnings generated by my patch so I'm not sure what it could be.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12409578/tasklimits-v4.patch against trunk revision 781115. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. -1 release audit. The applied patch generated 493 release audit warnings (more than the trunk's current 492 warnings). +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/451/testReport/ This message is automatically generated. The release audit says the following:
437d436
< [java] !????? /home/hudson/hudson-slave/workspace/Hadoop-Patch-vesta.apache.org/trunk/build/hadoop-781115_HADOOP-5170_PATCH-12409578/src/mapred/mapred-default.xml.orig
What does this mean? Is it just complaining that I modified mapred-default.xml? The failing contrib test seems to be unrelated ( I just committed this. Thanks, Matei!
This cannot be committed to the older branches since it's a new feature (as opposed to a regression). For those interested in using this in older branches, Todd and Jonathan's patches should work. Jonathan's 0.19 patch should also work in the 0.20 branch, I believe.
patch for yahoo internal repo
I'm sorry I'm coming in late on this. Are these new knobs at the job level? I think this the wrong direction. In particular, just limiting the number of slots won't do any good. The high ram job processing is a much better model. So rather than declaring max number of maps or reduces, we should allow "large task" jobs where each task is given multiple slots.I think we should revert this before it is released.
Ditto.
Couldn't agree more. See I see the following issues: mapred.max.{maps|reduces}.per.node Given that we mostly agree that the high-ram jobs are the right model, these features interact very badly with each other. Consider a high-ram job which has mapred.max.maps.per.cluster set to 100 and a few thousand tasks (a fraction of which is sufficient to exhaust the capacity of it's queue). Now we have 2 choices if we choose to incorporate mapred.max.{map|reduces}.per.cluster:
Either way we are in trouble.
Unfortunately, yes. +1 I wrote this patch because of demand for this feature from many users, including Cloudera customers, who had reasons to limit the number of tasks per job. If the patch interacts poorly with the capacity scheduler, wouldn't it make more sense to tell users of the capacity scheduler not to use this feature, or to improve
I agree that task limits are not the best solution for dealing with high-memory jobs, but that's not the only situation that this patch addresses. Here are some others:
I also think the problems Arun brought up in the capacity scheduler aren't insurmountable. To deal with the locality problem, you can give up slots when you run out of local data on them, like More importantly, I think Arun's issues will come up even if task limits aren't used. For example, suppose that a given queue's capacity is 25% of the cluster, say 250 slots, and that all the other queues are filled with jobs, so there is never excess capacity. Then if a high-memory job is submitted to the queue with 25% capacity, it has to behave exactly as if it has a limit of 250 tasks cluster-wide. The same problems that Arun brought up will happen: locality will be poor, the job might make bad bets, etc. This situation will be far more common than a user explicitly setting a limit on running tasks for the job, so if these two concerns are serious, maybe more work needs to be put into I am using this patch on 0.19 and 0.20, and will have to continue patching it if taken out of the 0.21 release. I've also recommended this patch and know of others who are using it... It solves a number of scheduling issues we have.
The idea of having "heavy" tasks works great for high-memory jobs. But we have a number of network-io bound jobs and cpu bound jobs. We can make the network bound jobs light, and cpu bound jobs heavy, but what we really want is to have one cpu bound job run per node (we have one core for it), and lots of network jobs per node... simultaneously. With task weights, how can I prevent the cpu bound job from taking up all the slots and letting in high numbers of network bound jobs at the same time?
Like I said, this patch introduced 2 features: per-node limits and per-job limits. The per-job cluster limit interacts poorly with the
We are missing the woods for the trees here. The user whose task is CPU-intensive is happy. But, what about the other users whose tasks need CPU too? How do we keep their tasks from starving on the same node? In particular there are no checks and balances on preventing multiple CPU-intensive tasks from being scheduled on the same node. If we knew that the other tasks were going to be IO intensive we could co-schedule them on this node, but we don't. This is the reason why Owen and I continue to insist that per-node task limits are a poor substitute for modelling resource usage and that a resource model is a pre-requisite for this feature. This feature works well for clusters with single users, but not in shared clusters.
The short-term fix is to submit these jobs to a special queue with limited capacity, possibly to queues with a hard upper-limit on their capacity: If the capacity scheduler will need to support cluster-wide task limits through
I fully agree that per-node limits aren't a solution to resource sharing in a multi-user environment. However, most Hadoop clusters outside Yahoo, Facebook and a few other large installations are essentially single-user. For these clusters, the per-node limit solves real problems until the time when we introduce a resource model to Hadoop (which will be a serious undertaking). This is why there were many votes on this feature. Whenever I've talked to people about the fair scheduler, task limits have been one of the most requested features. This is an optional feature - users don't have to use it and it is turned off by default. It has no performance impact for those who don't enable it. On the other hand there are a substantial number of users who are using it (or interested in using it), and would be left with no immediate alternative if it were pulled from the next release.
We can leave this feature in until there is a suitable replacement, after which time it can be deprecated so users can migrate to the replacement. Could that work? The problem with this patch is that it doesn't do anything useful and gets in the way of real fixes to the problem. Limiting the number of running tasks per a job is not what any users need. It is being used as an expedient way to hack in global resource limits. If I launch two cpu hungry jobs (even with a single user!), I will kill those nodes. That is not ok. While
Clearly in the long term, I think that a more dynamic model that tracks the resources being consumed and launches new tasks appropriately. In the mean time,
-1 Because once it is in, we need to support it. When it doesn't work for the users, they'll try and fix it. We already have users confused by too many knobs... This is not resolved and will likely be reverted.
Sorry, I should have clarified that we are simplifying the scope of Thanks for the clarification Arun. Are you targeting
Yes, it should be ready in a day or two. We will release a patch for the yahoo 0.20 branch too.
@Arun: is the yahoo 0.20 branch (that you refer to) an existing branch in apache svn? how soon can I access it
Arun, can you explain what the hard limit on capacity means? Is it option 2 in
Owen, I agree that having This might also be a good time to figure out exactly what scheduling functionality can go into JobInProgress/TaskTracker/etc and what should go into schedulers. I didn't think the limits in this patch added much complexity. They obey the contract of obtainNewMapTask, which is that it may or may not return a task. Schedulers already have to deal with jobs that have no tasks to launch on one heartbeat, and then have tasks on the next, because of speculation. So any scheduler that works with that should more or less be okay if the job chooses to launch a task based on its total number of running tasks. If we agreed on some kind of contract, then we would be able to implement common scheduling functionality in the mapreduce package rather than having it be contrib. Otherwise, as long as there are multiple groups working on scheduling on Hadoop, everyone will be worried that someone else's change will break their future work.
Yes, it will only had a hard-limit per-queue. This feature is aimed at solving a certain class of problem alone i.e. limiting fan-in for a specific resource.
s/had/add I think those limits are a very bad idea. If you think they are useful, I'd be ok with putting them into the fair share scheduler. Then you can experiment with them and see if they are useful outside of a single user research cluster. I suspect you'll quickly discover the lack of utility in this patch. I certainly don't think this is appropriate to go into the main map/reduce framework.
You're right that we can't support Jonathan's use case, but it isn't a valid use case for Hadoop. Hadoop is about sharing a cluster with other users. It is not a personal supercomputer operating system. (Although Arun and I did have a personal Hadoop supercomputer for a couple months while running the sort benchmarks. smile) I plan on reverting this and closing this as wont fix. I don't see how my use case is invalid.
And if decisions are always made for the betterment of very large clusters with multiple users rather than smaller clusters with a single user (or at least, in a controlled environment), you're going to alienate a vast majority of new users and even seasoned users who happen to not care about multi-user environments. I don't particularly care how I get done what I need to get done. But I have thousands of jobs submitted to my cluster every day and prior to this patch I was unable to get to 25% the capacity I have now because of the characteristics of a very small minority of the total number of jobs. I need a way to prevent CPU intensive jobs from eating all the cores. If I have 20 nodes and 100 CPU bound jobs, i only want 1 per node at a time. That strikes me as a common use case, not an invalid one. At the same time I'd like a way to allow 10 jobs per node of network bound jobs that do little else but wait around. From a boots on the ground perspective, I can say that this jira comes up more than once a week in discussion with new users, etc. It's easy to understand and gives the user an enormous amount of control with very few knobs. At least stick this functionality into a corner somewhere so those of us who want it can still use it. When there is more mature scheduling down the road, I'm more than happy to switch.
I beg to differ. many of my users run many concurrent single purpose Hadoop clusters in AWS. each tuned and sized to the particular load. i believe, HOD exists for this purpose as well, to some extent. re this patch. many users, I expect the bulk of Hadoop users, have small constrained clusters and are happy with hand crafting their workloads to get the best utilization and performance. whether or not this patch is 'correct', reading above I get the impression it is being used and the users find it useful. flat out reverting it seems heavy handed without an alternative available.
Jonathan, Can you please sketch the scenario you face in more detail? I'll hazard a guess and ask you to consider using high-ram jobs for your CPU intensive jobs and use For e.g. create a 'cpu-heavy' queue and limit it's capacity as a fraction of your total capacity using Yes, you will not be able to run too many other tasks on those nodes. But with Will that work? Jonathan, would you be okay if this feature was placed in the fair scheduler?
I agree with Owen that it's a good idea to remove task limits from the Hadoop core interface so as not to lock in the API. However, I will support them in the fair scheduler. Apart from the demand for this feature from the community, we have paying customers at Cloudera who asked for this patch and run it on multi-rack clusters. @Matei +1
@Arun I need to dig further into
Please don't assume everyone uses Hadoop the same way. Many users are more than willing to tune a specific job flow on a cluster for max compute power per $ and have little or no ad-hoc job submission. For some such a workflow is business critical and the cost/benefit of heavy hand-tuning is obvious. Yes, these tuning knobs are not good for all use cases and especially ad-hoc shared cluster use. Being able to specifically tune a business critical workflow is a very good thing for some. There is a reason this JIRA gained a lot of watchers fast and has the most votes for it. A resource model will be very useful, but it will never beat hand tuning for performance – it will only get part way there – and the two can be cooperative as well. Perhaps in the long term there could be a shared scheduler component that contains all the task limits with a default implementation, and each scheduler can optionally use, or override it. I think we need to take a step back.
The contention is that the specific knobs in question have long-term ramifications. In particular putting these in the framework proper, as opposed to specific schedulers, impose constraints on all schedulers regardless. We are happy to have specific schedulers targeted for specific use-cases/workloads support them, but we are opposed to allowing features which jeopardize one over the other, especially in the rather obvious way as we've highlighted above. In the past, and possibly in the future, we will make some choices which hurt use-case one over the other; in which case we will almost always be biased towards large, multi-user clusters. Having said that, there are some workarounds even with the CapacityScheduler (https://issues.apache.org/jira/browse/HADOOP-5170?focusedCommentId=12726751&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12726751 I've opened MAPREDUCE-704 to add per-node task limits in the fair scheduler. In addition, MAPREDUCE-698 is for per-pool task limits. Are people alright with having per-pool limits instead of per-job limits? Pools allow you to group multiple jobs under the same limit.
I can just put all my jobs into a single pool, and have the same functionality I would have now, correct?
You'd be able to place a limit on the whole pool - for example, 100 maps. If you submit one job at a time to the pool, then that job will get the whole limit. However, if you submit two jobs, they will share it, and they will get 100 maps in total, not 100 each. Right now the jobs will do fair sharing, meaning that each job will get 50 maps. However, FIFO scheduling within a pool will also be supported by the fair scheduler in the future (I have a fairly well-tested patch for it that I am porting to trunk).
So pooling is just a one-time thing when I submit the job? It's not something that persists and I submit things into?
I'm a big consumer of MR but have been on a need-to-know basis with respect to these things. I guess I now need to know. Again, part of what I liked about this issue/solution was that it's powerful, accessible, and easy to understand. I understand the concerns of larger users and the need to support this... And I would again ask if we could stick it into a corner somewhere so that it's still easy to access but does not get in the way of everything else. Otherwise, what I'd be interested in is an explanation / example of how users of this patch might accomplish the same types of things. For example, only allowing a particular job to use one task per node (or even total tasks at a time = total nodes). And at the same time, having other jobs that I allow 10s of tasks per node. I'm not following how that would work. No, pools are persistent. You submit a job to a particular pool by setting a jobconf property (e.g. set pool.name="my_pool"). Then you'll be able to have caps on total maps or total reduces running for each pool. For example, you could limit your DB import pool to 20 mappers, and then all DB import jobs together will get no more than 10 mappers.
I was planning to make the per-node limits be on a per job basis as in the current patch. However, for the per-job limits, it seemed to make more sense to let them apply across multiple jobs by placing them on a pool. So in other words, to answer your second question, limiting one job to one task/node will be done as in the current patch (with mapred.max.maps.per.node).
Thought about it and i buy the argument that these knobs should not have been added in the core framework. So +1 for reverting this patch.
This is the patch to rollback the change.
I reverted this. Matei may move this into the Fair Share Scheduler.
Integrated in Hadoop-Common-trunk #22 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/22/
. Removed change log entry because was reverted from mapreduce. Integrated in Hadoop-Mapreduce-trunk #20 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/20/
Editorial pass over all release notes prior to publication of 0.21.
Robert, this patch has been reverted, so don't include the release note.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HADOOP-4018), but it should serve your purpose too.As for limiting the number of tasks of job on each node, the same was being talked about at
HADOOP-4295, you may want to see the discussion there.