|
[
Permlink
| « Hide
]
Owen O'Malley added a comment - 05/Feb/08 10:40 PM
Note that we also need Map/Reduce to use the new method so that it only does one call per a file to get block sizes. This would require that FileSplit have a new constructor that takes an array of locations rather than computing it on demand. The locations do NOT need to be serialized in the read/write fields methods. FileInputFormat should use a single call to getFileLocations rather than the current getSize, getBlockSize, and getFileCacheHints (down in FileSplit).
Thanks Owen. Attached patch includes
1. new API getFileBlockLocations which invokes getBlockLocations to return BlockLocation[] 2. Changes FileSplit to store host information and return when getLocations() is invoked 3. Change FileInputFormat to one call of getFileBlockLocations and store host information in FileSplit using new constructor I ran the unit test and do not see failures. Will test benchmark and report the timings. I ran sort (twice) on 100 nodes on trunk+this patch. It took 28.4 and 27.3 minutes. Mukund mentioned it took 29.04 min on trunk.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12375144/HADOOP-2027-1.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 21 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac -1. The applied patch generated 619 javac compiler warnings (more than the trunk's current 608 warnings). release audit +1. The applied patch does not generate any new release audit warnings. findbugs -1. The patch appears to introduce 3 new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1781/testReport/ This message is automatically generated. You should deprecate the old FileSplit constructor and make it call the new one.
Set hosts to null in readFields. You should:
1. Not use strings of '*' around your javadoc. 2. Fill in the javadoc of public methods in BlockLocation. 3. I'd prefer using String[] in BlockLocation, since the API uses String rather than Text. 4. FileSystem.getFileBlockLocations should just pass the desired values into the constructor rather than setting them all, same for DFSClient 5. The indentation in FileInputFormat should bring lines to the open of the paren 6. Fix the calls to the now deprecated methods. Thanks! I'm looking forward to this patch. Incorporating changes suggested by Owen, removed javac warnings which were due to deprecated calls. I could not get rid of 2 of them which were deprecated calls to Kosmos FileSystem. I am submitting this patch for QA run. Will either try to get those fixed or open new JIRA to fix it
Resubmitting against latest trunk
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12375464/HADOOP-2027-2.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 30 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac -1. The applied patch generated 608 javac compiler warnings (more than the trunk's current 604 warnings). release audit +1. The applied patch does not generate any new release audit warnings. findbugs -1. The patch appears to introduce 3 new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1788/testReport/ This message is automatically generated. Another try fixing findbugs
I deprecated KFS getFileCacheHints and modified KFSEmulationImpl to call local Filesytems getFileBlockLocations
There are few javac warnings, which are due to other deprecated APIs like listPaths globPaths. I am attaching this patch against trunk Canceling and resubmitting patch
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12375522/HADOOP-2027-5.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 33 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs -1. The patch appears to cause Findbugs to fail. core tests -1. The patch failed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1790/testReport/ This message is automatically generated. I missed BlockLocations.java (new file) so build failed. I will resubmit it
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12375528/HADOOP-2027-6.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 33 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac -1. The applied patch generated 605 javac compiler warnings (more than the trunk's current 603 warnings). release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1791/testReport/ This message is automatically generated. The javac warnings were expected due to other deprecated APIs.
Found the 2 additional warnings, they were from PhasedFileSystem.java
> [javac] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/java/org/apache/hadoop/mapred/PhasedFileSystem.java:300: warning: [deprecation] getFileCacheHints(org.apache.hadoop.fs.Path,long,long) in org.apache.hadoop.fs.FilterFileSystem has been deprecated> [javac] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/java/org/apache/hadoop/mapred/PhasedFileSystem.java:300: warning: [deprecation] getFileCacheHints(org.apache.hadoop.fs.Path,long,long) in org.apache.hadoop.fs.FileSystem has been deprecated Attaching same patch by regenerating against trunk.
sorry had missed BlockLocation again
Uploading new one with comments from dhruba.
Dhurba suggested it would be good to have information about host:port which is already provided by namenode call.
So i have one more API getNames() which is similar to DatanodeID's getName, this returns hostname:port and getHosts() returns hostnames as earlier. He suggest this is useful when we consider running 2 datanodes on same node. Attached is the patch which address this. -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12376764/HADOOP-2027-9.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 33 new or modified tests. patch -1. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1866/console This message is automatically generated. Regenerating against trunk. Tested this patch, applies clean on trunk.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12376793/HADOOP-2027-10.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 33 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac -1. The applied patch generated 617 javac compiler warnings (more than the trunk's current 614 warnings). release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1874/testReport/ This message is automatically generated. TestRackAwareTaskPlacement was failing because of my latest changes. While adding getNames() method, I tried to derive getHosts() from string returned by getName(), this had ipaddress:port. But we needed hostnames. So, I created 2 separate arrays within BlockLocation. Now both getHosts() and getNames() return the expected output. Attaching a new patch.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12377229/HADOOP-2559-11.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 36 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac -1. The applied patch generated 593 javac compiler warnings (more than the trunk's current 590 warnings). release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1902/testReport/ This message is automatically generated. modified TestTextInputFormat. With the updated patch, if file is of 0 length it will not be added to the splits.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12377232/HADOOP-2559-12.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 42 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac -1. The applied patch generated 592 javac compiler warnings (more than the trunk's current 590 warnings). release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1903/testReport/ This message is automatically generated. All tests passed. 2 additional javac warnings are due to PhasedFileSystem.java as expected.
For now, creating empty hosts array when of input file is zero length. Owen opened HADOOP-2952 to address zero length files. Attaching another patch.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12377271/HADOOP-2559-13.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 39 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac -1. The applied patch generated 592 javac compiler warnings (more than the trunk's current 590 warnings). release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1905/testReport/ This message is automatically generated. All tests passed. 2 additional javac warnings are due to PhasedFileSystem.java as expected.
Attaching another patch after changing getFileCacheHints in FileSystem.java and KFS.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12377568/HADOOP-2027-14.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 39 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac -1. The applied patch generated 599 javac compiler warnings (more than the trunk's current 598 warnings). release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1938/testReport/ This message is automatically generated. javac warning from PhasedFileSystem. All other tests passed.
I just committed this. Thanks, Lohit!
Integrated in Hadoop-trunk #434 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/434/
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||