|
Current flow relevant to this discussion:
TaskTracker.offerService() -> TaskTracker.checkForNewTasks() -> if (TaskTracker.enoughFreeSpace()) then poll/startNewTask We could put the above checks (infact we can do better by checking if we have assigned fileSplit's size * conf.getFloat("map.output.growth.factor", 1.0)) in TaskTracker.enoughFreeSpace()... ... alternatively we could make '(sum over running tasks of (1.0 - done) * allocation)' part of TaskTrackerStatus i.e. a 'availableDiskSpace' member, check to ensure that 'sufficient' free space is available on the tasktracker before assigning it the task itself in JobInProgress.findNewTask - this ensures that a task isn't allocated in the first place to a tasktracker if it can't handle it. What do you guys think? Am I missing out on something which prevents option #2 from working? option #2 with putting the "unallocated" disk space into the TaskTrackerStatus should work well.
We should also log something when tasks are not accepted due to space limitations. At present I think nothing is displayed in this case.
It should be very useful to include the 'unallocated disk space' in the global and per-host jsps so as to provide an easy way for operator to diagnose if and when tasks can't be allocated due to lack of diskspace... I think it should be a part of this same bug.
> include the 'unallocated disk space' in the global and per-host jsps
Yes, I agree this is another metric that should be displayed in the central web ui. It should be reported through the metrics API, as discussed in I see the value in Doug's suggestion... for e.g. at some point in the future we might also put in metrics like CPU load, VM stats etc. and this would let the JobTracker make 'smarter' decisions about which task to assign to which TaskTrackers i.e. CPU-bound tasks to IO-laden TTs and vice-versa.
I do agree that it might be a very futuristic scenario, but the point is to keep the infrastructure robust when we can... Actually, I'd rather have the unallocated disk space in the heartbeat, because when
> I'd rather have the unallocated disk space in the heartbeat [ ...]
I agree, but I think we should use a general mechanism to route metrics to the jobtracker through heartbeats, rather than hack things in one-by-one. Taking things forward, looks like both Owen/Doug agree that we need to send metrics through the heartbeat...
... given this we are looking at implementing a MapReduceMetricsContext which sends over heartbeat (with metrics) via RPC to the JT and TaskTracker.offerService() becoming a callback for the timer. Does that make sense or do folks prefer something different? It also necessiates a 'Writable' MetricsRecordImpl (for RPC) and some apis for 'reading' the metrics i.e. getMetric/getTag apis which the JobTracker can use to retrieve information.
Here's my proposed fix:
1) Add a "free space on compute node" field to TaskTrackerStatus. This is the real physical space available, plus the sum of (commitment - reservation) for each running map task. 2) Add a "space used by this task" and "space reserved for task" to TaskStatus as well. 3) Add a "space to reserve" to either Task or MapTask. This is computed by the JobTracker, and used by the TaskTracker 4) Create a new ResourceConsumptionEstimator class, and have an instance of that type for each JobInProgress. This will have, at a minimum, reportCompletedMapTask(MapTaskStatus t) and estimateSpaceForMapTask(MapTask mt) The implementation would probably be a thread that processes asynchronously, and updates an atomic value that'll be either the estimated space requirement, or else the estimated ratio between input size and output size. Until sufficiently many maps have completed (10%, say) the size estimate would just be the size of each map's input. Afterwards, we'll take the 75th percentile of the measured blowup in task size. 5) Modify obtainNewMapTask to return null if the space available on the given task tracker is less than the estimate of available space. 6) To avoid deadlocks if there are multiple jobs contending for space, abort the job if too many trackers are rejected as having insufficient space. Thoughts? Here's a stab at solving the issue.
I've tested this locally, and it doesn't break anything, when space IS available. I haven't yet tested this in the low-disk space case.
Ari, I haven't looked at the patch yet, but it'd help if could you please give an example for this one with some numbers. Here's what we have currently. ResourceEstimator keeps an estimate of how big the average map's output is. As Map tasks complete, we update this. If a node has less than twice the average outputsize in free disk space, we don't assign tasks to it. Haven't implemented the percentile aspect; average is computationally much easier.
So if a job has 10 GB of input, split across ten map tasks, tasks will only be started on nodes with at least two gigabytes free. It's been tested locally, and indeed, jobs only go to a task tracker with sufficient space. Next step is testing at scale, on a cluster, Current patch.
Limitations: temporarily canceling patch; expect final version within a few days.
Tested on a small cluster, seems to work.
One limitation is that we don't automatically resolve deadlocks. This could be done, e.g., by failing tasks that can't be placed for a long period. -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12383592/diskspaceest_v3.patch against trunk revision 664208. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 3 new Findbugs warnings. -1 release audit. The applied patch generated 196 release audit warnings (more than the trunk's current 195 warnings). -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2613/testReport/ This message is automatically generated. Fix pedantic fixbugs warnings; add license at top of file.
Notes: TestHarFileSystem error is believed unrelated to patch. revised, should no longer trigger findbugs
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12384165/diskspaceest_v4.patch against trunk revision 668867. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2677/testReport/ This message is automatically generated. -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12385993/clean_spaceest.patch against trunk revision 676772. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2861/testReport/ This message is automatically generated.
I don't have strong feelings about whether to do space-consumed measurement in the TaskTracker or the Task. I figured it made more sense to fill out the whole TaskStatus in one place.Otherwise you get confused in the TaskTracker code, whether or not the space-consumed has been filled in yet. I'm open to doing this the other way 'round, and having TaskTracker responsible for it. Certainly if there were other similar resource counters being filled in in TaskTracker, this one ought to be.
I was tempted to use metrics for this, and looked at piggybacking of this sort of thing more generally on heartbeats. I was promptly shot down. There was a strong sentiment, notably from Owen and Arun, that Hadoop's core functionality shouldn't depend on Metrics, and that Metrics should just be for analytics. As per Vinod's suggestion, move lookup into TaskTracker.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12386347/spaceest_717.patch against trunk revision 677781. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2896/testReport/ This message is automatically generated. I just committed this. Thanks, Ari!
The reduce task for sortvalidator , 500 nodes, seem to get stuck with the following message (several of them), even though the sort itself succeeded fine. Could there be a bug in the estimation of the reduce input size?
2008-08-14 07:56:16,507 WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task. Node tracker_xxx.com:xxx..com/<IPADDR>:58251 has 204889718784 bytes free; but we expect reduce input to take 1004644589190 Integrated in Hadoop-trunk #581 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/581/
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
What do you guys think is a reasonable default for "map.output.growth.factor" ? 1.0? Do we need more leeway?