|
One obvious impact is on the scheduling of tasks. If a memory intensive job happens to be the current job being considered, what would happen if a TT comes in with lesser than required amount of memory. Scheduling tasks of other jobs to this TT could lead to starvation of the memory intensive job. Not scheduling other jobs could lead to under utilization of the cluster. However with different fairness scheduling primitives (like user limits, etc) which are being discussed in
Some more details on implementation:
Comments ? Initial patch for review and to encourage discussion
The patch incorporates the approach mentioned in the earlier comments. Specifically:
With these changes in place, schedulers such as Please provide feedback on this implementation. Given the proposals in this Jira, and in
The goal is to allow memory intensive jobs to run without affecting other jobs and also detecting/killing which jobs are violating their memory contract with Hadoop. Here is how we propose to do this:
After some internal discussions, we are proposing the following behavior with respect to default values of the configuration options MAX_MEM and MAX_MEM_PER_TASK. In the description below, MAX_MEM is specified for a tasktracker, and MAX_MEM_PER_TASK is specified in a job's configuration.
As the above choice means that the RAM intensive jobs can prevent other jobs from running, there must be some way to counter-balance this effect. One way could be for schedulers like Comments ? I'm in favor of the approach Hemanth has described above because it lets us reason about memory constraints in terms of slots/containers. In his example, each slot on the TT has a memory limit of 4GB. This also means that a task is guaranteed up to 4GB, if it runs in that slot. So if a task needs 6GB, it ends up using 2 slots. This helps schedulers, because they're really dealing with slots. For example, if they're keeping track of how many slots are used by a user or job, then assigning two slots to the task that wants 6GB works really well.
Another good thing about this approach is that tasks are not affected by other tasks, in terms of memory guarantees. Suppose task1 wanted 7.5 GB. This would leave 0.5GB for task2, which is not right, as task2 shouldn't suffer from task1's needs. It's better task2 not run in this case. The attached file implements the proposal mentioned above.
Following is a summary of the changes:
The patch contains some additional log statements that I will remove after the review is completed. Also, it is missing unit tests. Request a review of the code, except for these points. Regarding tests, I've tested the changes manually. I am looking for some ideas on how to automate these. What would be ideal is to test the following: Configure the memory related variables, schedule tasks in a predetermined order, verify that each time, the free memory is computed correctly. The last part seems to require hooks into the heartbeat processing code on JT or TT. Alternatively, we can make the free memory computation package private. The latter seems to be very hacky. Any other ideas ? Attaching a more complete patch with test cases and after checking test-patch results.
The latest patch adds a JUnit test class that works as follows:
In order for this test to work, I had to make a few changes to some core classes. Please comment on whether these are reasonable. The changes are:
Review comments:
Am assuming that the accessor method in JobInProgress is only for future use inside a scheduler. +1 for addition of the api in MiniMRCluster. Code looks fine otherwise. Test case too. Just seen that
I had intended to consolidate all the resource estimation and tracking stuff in the ResourceEstimator class. Does that seem sensible?
It's possible that some of that stuff belongs, at least in part, in the scheduler. On the other hand, tracking current and estimated consumption is in some way orthogonal to all the other scheduling decisions, so I think it still makes sense to keep resource management in its own class[es]. Ari, I had looked at the ResourceEstimator class that you defined in
On the other hand, to communicate the used disk space, you are following roughly the same mechanism as what is defined here IIRC - which is to compute the free disk space in the heartbeat and communicate it via the TaskTrackerStatus. Vinod's comment was to unify your change as part of the resource map defined here so that it would be done in a like manner. Makes sense ? That'll teach me to comment before coffee.
Hemanth – Yes, I think it makes sense to pass the disk space via the new resourceMap. I won't have time to revise the disk space tracking for a week or two. Shall I open a JIRA, so we don't lose track? My only very slight qualm is that I'd rather avoid passing the names, as well as values, of the various resources we're tracking. As I understand, heartbeats are a bottleneck, so it's worth being frugal there. But probably it doesn't make sense to worry about that at this stage. We can do serialization magic later if it turns out to be worthwhile. Ari - no issues, I will modify it as part of this patch. Because your patch is already committed, I can change it now. Also, its been raised as a comment on this JIRA.
Regarding passing names, we did think of the load on heartbeats. However, I think elsewhere the number of heartbeats is being cut down. Hence, we decided to go with an approach that was more clear. In an offline discussion with Devaraj, he suggested that I not modify the job conf when localizing the task. Also, the patch no longer applies to trunk. I will modify this behavior and upload a new patch.
New patch addressing some of the review comments.
Sigh. Please ignore the last patch. The way I tried to make the resource map handle generic writables seems incorrect. Will upload a new one.
Incorporated review comments and synchronized with trunk.
Done. Synchronized it on the tasktracker object.
I was not able to see how to make it work with Writables. The problem is with reading back the objects. It seemed like we need to know what Writable the object actually is, in order to read it back fully. That would have prevented us from making it generic. There is an ObjectWritable, but that seems an overkill. What I've done instead is to create a small wrapper object that implements Writable called ResourceStatus. This is used to encapsulate all fields that need to be reported to the JT in the TaskTrackerStatus. Please comment if this makes sense.
Modified the availableSpace as a field in ResourceStatus. Ari, can you please check if this change looks OK ? I didn't see a test case in
Done this. Rather than localizing, I calculate the default free memory when required - for e.g in findFreeVirtualMemory. I am okay with the patch. The one thing i do want to point out is that the ResourceStatus could be made extensible and any component in the TT that wants to advertise a resource Key/Value info can do so (as opposed to hardcoding the memory/disk-space resources only). But this could be for later.
The other thing is the way we handle -Xmx in this setup. Assume a case where the user hasn't specified any memory requirement for his job. The memory that a task would get is proportional to the amount of memory in the TT/#slots. Let's say for this cluster instance, it is 1G. Now if his -Xmx, which is an absolute number, is above this, say 1.5G, would it work? Note that the task JVM might work even with 1G. It is just the user happened to specify it as 1.5G. Minor nit: ResourceStatus should definitely use WritableUtils.{read|write}VLong rather than DataOutput.{read|write}Long.
I'd suggest that:
This patch addresses all comments from Arun and Owen. Regarding the following comment from Devaraj:
As far as I could see, the JVM seems to be allocating the Xmx value in chunks. So, it might not fail immediately, and might not at all if the task doesn't require more than 1G. However, if it does come over 1G, it might get killed. There seem to be two ways to handle this:
The latter option looks a little hacky (having to parse mapred.child.java.opts etc) and also could behave differently in different conditions making it difficult to debug. It seems simpler to just document this. On this line, I've updated both the hadoop-default.xml documentation and the Forrest documentation in Map/Red tutorial that mentions these options. Let me know if this seems fine. Also, I've retained mapred.child.ulimit, since it could be used for parameters other than memory.
Currently mapred.child.ulimit cannot be used for anything else, it is hardcoded to use it as virtual mem limit (ulimit -v) only. Should we change this? Also, I think we should deprecate setting vlimits via mapred.child.ulimit, they both (intend) to do the same thing. Documenting seems fine to me for now. Regarding the 'ulimit -v', I think we should deprecate the API getUlimitMemoryCommand.
Regarding deprecation, here is a way to handle this case. If the new option is defined in this jira is enabled (by default it is disabled), then the ulimit for memory is not applied when the task is launched. If the new option is disabled (the tasktracker has a negative number for max mem), then we apply the ulimit settings (as it is done today)...
New patch that deprecates getUlimitMemoryCommand, and updates documentation for the same. Also, the behavior that Devaraj mentions is implemented with respect to when the ulimit is set. Put another way, if an administrator has specified a value for the tasktracker's memory limit, we give it higher priority than ulimit setting. This seems to be the right thing to do. In the next release, we can completely remove the ulimit configuration.
Also, I ran test-patch and get a -1 for the number of javac warnings as the deprecation warning will come up now. This is expected, right ? Comments from others on these changes ? Why does this need to be deprecated? Generalizing it as a way of setting up other limits or ENV config seems more correct. Might I not want to limit the size of any process as well as the size of the process tree?
This is more secure when it works, since it keeps me from even creating the large object. The other approach might miss the problem until too late... I agree with Eric that keeping the ulimit capability is a good thing. It is better to prevent the problem than react to the problem, given a choice...
+1 from me too. Since there is consensus on keeping the ulimit method, I've reverted the changes and uploaded a new patch. Will submit this to Hudson, as all comments have been taken care of now.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12388580/HADOOP-3759.patch against trunk revision 688101. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3085/testReport/ This message is automatically generated. -1 on javadoc is due to
-1 on core tests is due to I can't figure out why there's a -1 on the contrib tests, no tests appear to have failed. And I can't see anything from the console log as well. I notice that many tests which ran on Hudson have also got a -1 on contrib. Therefore, I suspect it is unlikely to be caused by this patch. I just committed this. Thanks Hemanth!
Integrated in Hadoop-trunk #586 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/586/
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HADOOP-3581proposes a maximum amount of virtual memory, say MAX_MEM, that all tasks (and their descendants) on that tasktracker would use.HADOOP-3581's fix) and reports that to the jobtracker. This is similar to the approach followed inHADOOP-657for disk space.HADOOP-3581, it will be killed.Comments ?