|
sort benchmarks:
Updated patch (depends on
This patch makes some minor performance improvements, adds documentation, and correctly effects record compression in-place.
The following should probably be implemented as separate JIRAs:
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12377508/2919-2.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac -1. The applied patch generated 590 javac compiler warnings (more than the trunk's current 589 warnings). release audit +1. The applied patch does not generate any new release audit warnings. findbugs -1. The patch appears to introduce 5 new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1927/testReport/ This message is automatically generated. Fixed findbugs warnings, suppressed spurious serialization warning for private IOE subclass
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12377540/2919-3.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1930/testReport/ This message is automatically generated. I haven't been able to reproduce this failure in Linux or on MacOS. Looking at the console output, the timeout looks related to
[junit] 2008-03-10 23:22:51,803 INFO dfs.DataNode (DataNode.java:run(1985)) - PacketResponder blk_1646669170773132170 1 Exception java.net.SocketTimeoutException: 60000 millis timeout while waiting for /127.0.0.1:34190 (local: /127.0.0.1:34496) to be ready for read
[junit] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:188)
[junit] at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:135)
[junit] at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:121)
[junit] at java.io.DataInputStream.readFully(DataInputStream.java:176)
[junit] at java.io.DataInputStream.readLong(DataInputStream.java:380)
[junit] at org.apache.hadoop.dfs.DataNode$PacketResponder.run(DataNode.java:1957)
[junit] at java.lang.Thread.run(Thread.java:595)
Since the failure is coming from TestMiniMRDFSSort- code this patch certainly affects- this result is not auspicious, but I suspect the issue is not related to this patch. Merged patch with latest trunk
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12377841/2919-4.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1958/testReport/ This message is automatically generated. I just started going through this, but I've seen some nits:
1. IndexedSorter is importing Progressable, but not using it. 2. utils should depend on mapred, so IndexedSorter and IndexdSortable should move out to utils. 3. QuickSort should use empty braces rather than just a semicolon, so replace: while (++i < r && s.compare(i, x) < 0); while (--j > x && s.compare(x, j) < 0); with while (++i < r && s.compare(i, x) < 0) { } // NOTHING while (--j > x && s.compare(x, j) < 0) { } // NOTHING A few more suggestions:
1. reduce from warn to debug for single value spill Addresses most of Owen's feedback, excluding the following:
For now, storing key start/end indices is useful enough that I'm loathe to make a corner case of that for now. Re-copying the key is unnecessary, but- I'm guessing- not very costly relative to adding an additional branch into the compare (since it should be a very infrequent case).
I haven't been able to find a Writable with this property ( Some more comments:
1. rewrite softlimit computations moving the ?: operators down to just calculate the number of entries & bytes 2. include size of record and size of buffer in map output buffer too small exception 3. rename inmemuncompressedbytes since it can contain compressed bytes too 4. make combine and spill call close on combiner in finally block 5. remove IllegalArgumentException in memuncompressedbytes declaration 6. make static finals for the index offsets (0, 1, 2, and 3) (whoops; caught unrelated changes)
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12378286/2919-6.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2010/testReport/ This message is automatically generated. This patch is idential to 2919-6, but the output buffer is released prior to the merge.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12378667/2919-7.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2074/testReport/ This message is automatically generated. sort on 100 nodes:
I was concerned that TestMiniMRDFSSort was failing on Hudson with this patch, even though it was working on "real" machines. It looks like it was primarily resource starvation on the zones machines. I filed
I just committed this. Thanks, Chris!
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
kvoffset buffer kvindices buffer _____________ _________________ |offset k1,v1 | | partition k1,v1 | |offset k1,v2 | | k1 offset | ... | v1 offset | |offset kn,vn | | partition k2,v2 | | k2 offset | | v2 offset | ... | partition kn,vn | | kn offset | | vn offset |By default, the total size of the accounting space is 5% of io.sort.mb. We build on the work done in
HADOOP-1965, but rather than using 50% of io.sort.mb before a spill, we set a "soft" limit that defaults to 80% of the number of records or 80% of the K,V buffer before starting a spill thread. Note that this limit does not require us to query each partition collector for its memory usage, but can be effected by examining our indices. Rather than permitting the spill thread to "own" references to the buffers, we maintain a set of indices into the offset and k,v byte buffers defining the area of each in which the spill buffer is permitted to work. According to the Java VM spec, we can assume that reading/writing array elements does not require a lock on the array.We maintain three indices for both the accounting and k,v buffers: start, end, and index. The area between start and end is available to the spill, while the area between end and index (in truth, a marker noting end of the last record written) contains "spillable" data yet to be written to disk. If the soft limit is reached- or if one attempts a write into the buffer that is too large to accommodate without a spill- then the task thread sets the end index to the last record marker and triggers a spill. While the spill is running, the area between the start and end indices is unavailable for writing from collect(K,V) and the task thread will block until the spill has completed if the index marker hits the start marker.
It is worth mentioning that each key must be contiguous to be used with a RawComparator, but values can wrap around the end of the buffer. This requires us to note the "voided" space in the buffer that contains no data. When the spill completes, it sets the start marker to the end marker, making that space available for writing. Note that it must also reset the void marker to the buffer size if the spill wraps around the end of the buffer (the rightmost case in the preceding figure). The "voided" marker is owned by whichever thread needs to manipulate it, so we require no special locking for it.
When we sort, we sort all spill data by partition instead of creating a separate collector for each partition. Further, we can use appendRaw (as was suggested in
HADOOP-1609) to write our serialized data directly from the k,v buffer to our spill file writer instead of deserializing each prior to the write. Note that for record-compressed data (when not using a combiner), this permits us to store compressed values in our k,v buffer.The attached patch is a work in progress, and is known to suffer from the following deficiencies: