Hadoop Common
  1. Hadoop Common
  2. HADOOP-2919

Create fewer copies of buffer data during sort/spill

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.17.0
    • Component/s: None
    • Labels:
      None

      Description

      Currently, the sort/spill works as follows:

      Let r be the number of partitions
      For each call to collect(K,V) from map:

      • If buffers do not exist, allocate a new DataOutputBuffer to collect K,V bytes, allocate r buffers for collecting K,V offsets
      • Write K,V into buffer, noting offsets
      • Register offsets with associated partition buffer, allocating/copying accounting buffers if nesc
      • Calculate the total mem usage for buffer and all partition collectors by iterating over the collectors
      • If total mem usage is greater than half of io.sort.mb, then start a new thread to spill, blocking if another spill is in progress

      For each spill (assuming no combiner):

      • Save references to our K,V byte buffer and accounting data, setting the former to null (will be recreated on the next call to collect(K,V))
      • Open a SequenceFile.Writer for this partition
      • Sort each partition separately (the current version of sort reuses, but still requires wrapping, indices in IntWritable objects)
      • Build a RawKeyValueIterator of sorted data for the partition
      • Deserialize each key and value and call SequenceFile::append(K,V) on the writer for this partition

      There are a number of opportunities for reducing the number of copies, creations, and operations we perform in this stage, particularly since growing many of the buffers involved requires that we copy the existing data to the newly sized allocation.

      1. 2919-7.patch
        62 kB
        Chris Douglas
      2. 2919-6.patch
        62 kB
        Chris Douglas
      3. 2919-5.patch
        61 kB
        Chris Douglas
      4. 2919-4.patch
        60 kB
        Chris Douglas
      5. 2919-3.patch
        60 kB
        Chris Douglas
      6. 2919-2.patch
        60 kB
        Chris Douglas
      7. 2919-1.patch
        57 kB
        Chris Douglas
      8. 2919-0.patch
        53 kB
        Chris Douglas

        Issue Links

          Activity

          Hide
          Owen O'Malley added a comment -

          I just committed this. Thanks, Chris!

          Show
          Owen O'Malley added a comment - I just committed this. Thanks, Chris!
          Hide
          Owen O'Malley added a comment -

          I was concerned that TestMiniMRDFSSort was failing on Hudson with this patch, even though it was working on "real" machines. It looks like it was primarily resource starvation on the zones machines. I filed HADOOP-3142 and HADOOP-3143 to address the problems.

          Show
          Owen O'Malley added a comment - I was concerned that TestMiniMRDFSSort was failing on Hudson with this patch, even though it was working on "real" machines. It looks like it was primarily resource starvation on the zones machines. I filed HADOOP-3142 and HADOOP-3143 to address the problems.
          Hide
          Owen O'Malley added a comment -

          +1

          Show
          Owen O'Malley added a comment - +1
          Hide
          Chris Douglas added a comment -

          sort on 100 nodes:

            trunk (r641466) trunk + patch
          uncompressed bytes 38:17 34:38
          compressed text 34:18 31:37
          Show
          Chris Douglas added a comment - sort on 100 nodes:   trunk (r641466) trunk + patch uncompressed bytes 38:17 34:38 compressed text 34:18 31:37
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12378667/2919-7.patch
          against trunk revision 619744.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 3 new or modified tests.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests -1. The patch failed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2074/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2074/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2074/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2074/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12378667/2919-7.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2074/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2074/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2074/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2074/console This message is automatically generated.
          Hide
          Chris Douglas added a comment -

          This patch is idential to 2919-6, but the output buffer is released prior to the merge.

          Show
          Chris Douglas added a comment - This patch is idential to 2919-6, but the output buffer is released prior to the merge.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12378286/2919-6.patch
          against trunk revision 619744.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 3 new or modified tests.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests -1. The patch failed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2010/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2010/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2010/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2010/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12378286/2919-6.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2010/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2010/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2010/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2010/console This message is automatically generated.
          Hide
          Chris Douglas added a comment -

          (whoops; caught unrelated changes)

          Show
          Chris Douglas added a comment - (whoops; caught unrelated changes)
          Hide
          Chris Douglas added a comment -

          Integrated Owen's feedback

          Show
          Chris Douglas added a comment - Integrated Owen's feedback
          Hide
          Owen O'Malley added a comment -

          Some more comments:
          1. rewrite softlimit computations moving the ?: operators down to just calculate the number of entries & bytes
          2. include size of record and size of buffer in map output buffer too small exception
          3. rename inmemuncompressedbytes since it can contain compressed bytes too
          4. make combine and spill call close on combiner in finally block
          5. remove IllegalArgumentException in memuncompressedbytes declaration
          6. make static finals for the index offsets (0, 1, 2, and 3)

          Show
          Owen O'Malley added a comment - Some more comments: 1. rewrite softlimit computations moving the ?: operators down to just calculate the number of entries & bytes 2. include size of record and size of buffer in map output buffer too small exception 3. rename inmemuncompressedbytes since it can contain compressed bytes too 4. make combine and spill call close on combiner in finally block 5. remove IllegalArgumentException in memuncompressedbytes declaration 6. make static finals for the index offsets (0, 1, 2, and 3)
          Hide
          Chris Douglas added a comment -

          Addresses most of Owen's feedback, excluding the following:

          2. think about case when key precisely fills buffer and whether we need to reset in that case

          For now, storing key start/end indices is useful enough that I'm loathe to make a corner case of that for now. Re-copying the key is unnecessary, but- I'm guessing- not very costly relative to adding an additional branch into the compare (since it should be a very infrequent case).

          4. add test case where key compare uses length (bytes writable?)

          I haven't been able to find a Writable with this property (HADOOP-3046), but I'll add a test case presently.

          Show
          Chris Douglas added a comment - Addresses most of Owen's feedback, excluding the following: 2. think about case when key precisely fills buffer and whether we need to reset in that case For now, storing key start/end indices is useful enough that I'm loathe to make a corner case of that for now. Re-copying the key is unnecessary, but- I'm guessing- not very costly relative to adding an additional branch into the compare (since it should be a very infrequent case). 4. add test case where key compare uses length (bytes writable?) I haven't been able to find a Writable with this property ( HADOOP-3046 ), but I'll add a test case presently.
          Hide
          Owen O'Malley added a comment -

          A few more suggestions:

          1. reduce from warn to debug for single value spill
          2. think about case when key precisely fills buffer and whether we need to reset in that case
          3. fix sizes to raw comparator
          4. add test case where key compare uses length (bytes writable?)
          5. rename mark to avoid mark/reset confusion
          6. use byte[] inside of reset instead of dataoutputbuffer
          7. remove extra semicolon in MapTask.getVBytesForOffset
          8. don't ignore interruptedexception
          9. wrap remotely caught exceptions in new exception and set cause to the remotely caught exception

          Show
          Owen O'Malley added a comment - A few more suggestions: 1. reduce from warn to debug for single value spill 2. think about case when key precisely fills buffer and whether we need to reset in that case 3. fix sizes to raw comparator 4. add test case where key compare uses length (bytes writable?) 5. rename mark to avoid mark/reset confusion 6. use byte[] inside of reset instead of dataoutputbuffer 7. remove extra semicolon in MapTask.getVBytesForOffset 8. don't ignore interruptedexception 9. wrap remotely caught exceptions in new exception and set cause to the remotely caught exception
          Hide
          Owen O'Malley added a comment -

          I just started going through this, but I've seen some nits:
          1. IndexedSorter is importing Progressable, but not using it.
          2. utils should depend on mapred, so IndexedSorter and IndexdSortable should move out to utils.
          3. QuickSort should use empty braces rather than just a semicolon, so replace:
          while (++i < r && s.compare(i, x) < 0);
          while (--j > x && s.compare(x, j) < 0);
          with
          while (++i < r && s.compare(i, x) < 0) { } // NOTHING
          while (--j > x && s.compare(x, j) < 0) { } // NOTHING

          Show
          Owen O'Malley added a comment - I just started going through this, but I've seen some nits: 1. IndexedSorter is importing Progressable, but not using it. 2. utils should depend on mapred, so IndexedSorter and IndexdSortable should move out to utils. 3. QuickSort should use empty braces rather than just a semicolon, so replace: while (++i < r && s.compare(i, x) < 0); while (--j > x && s.compare(x, j) < 0); with while (++i < r && s.compare(i, x) < 0) { } // NOTHING while (--j > x && s.compare(x, j) < 0) { } // NOTHING
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12377841/2919-4.patch
          against trunk revision 619744.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 3 new or modified tests.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests -1. The patch failed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1958/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1958/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1958/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1958/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12377841/2919-4.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1958/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1958/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1958/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1958/console This message is automatically generated.
          Hide
          Chris Douglas added a comment -

          Merged patch with latest trunk

          Show
          Chris Douglas added a comment - Merged patch with latest trunk
          Hide
          Devaraj Das added a comment -

          Cancelling for resubmission

          Show
          Devaraj Das added a comment - Cancelling for resubmission
          Hide
          Chris Douglas added a comment -

          I haven't been able to reproduce this failure in Linux or on MacOS. Looking at the console output, the timeout looks related to HADOOP-2971. I'm seeing a handful of the following errors from Hudson:

              [junit] 2008-03-10 23:22:51,803 INFO  dfs.DataNode (DataNode.java:run(1985)) - PacketResponder blk_1646669170773132170 1 Exception java.net.SocketTimeoutException: 60000 millis timeout while waiting for /127.0.0.1:34190 (local: /127.0.0.1:34496) to be ready for read
              [junit]   at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:188)
              [junit]   at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:135)
              [junit]   at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:121)
              [junit]   at java.io.DataInputStream.readFully(DataInputStream.java:176)
              [junit]   at java.io.DataInputStream.readLong(DataInputStream.java:380)
              [junit]   at org.apache.hadoop.dfs.DataNode$PacketResponder.run(DataNode.java:1957)
              [junit]   at java.lang.Thread.run(Thread.java:595)
          

          Since the failure is coming from TestMiniMRDFSSort- code this patch certainly affects- this result is not auspicious, but I suspect the issue is not related to this patch.

          Show
          Chris Douglas added a comment - I haven't been able to reproduce this failure in Linux or on MacOS. Looking at the console output, the timeout looks related to HADOOP-2971 . I'm seeing a handful of the following errors from Hudson: [junit] 2008-03-10 23:22:51,803 INFO dfs.DataNode (DataNode.java:run(1985)) - PacketResponder blk_1646669170773132170 1 Exception java.net.SocketTimeoutException: 60000 millis timeout while waiting for /127.0.0.1:34190 (local: /127.0.0.1:34496) to be ready for read [junit] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:188) [junit] at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:135) [junit] at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:121) [junit] at java.io.DataInputStream.readFully(DataInputStream.java:176) [junit] at java.io.DataInputStream.readLong(DataInputStream.java:380) [junit] at org.apache.hadoop.dfs.DataNode$PacketResponder.run(DataNode.java:1957) [junit] at java.lang.Thread.run(Thread.java:595) Since the failure is coming from TestMiniMRDFSSort- code this patch certainly affects- this result is not auspicious, but I suspect the issue is not related to this patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12377540/2919-3.patch
          against trunk revision 619744.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 3 new or modified tests.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests -1. The patch failed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1930/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1930/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1930/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1930/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12377540/2919-3.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1930/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1930/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1930/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1930/console This message is automatically generated.
          Hide
          Chris Douglas added a comment -

          Fixed findbugs warnings, suppressed spurious serialization warning for private IOE subclass

          Show
          Chris Douglas added a comment - Fixed findbugs warnings, suppressed spurious serialization warning for private IOE subclass
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12377508/2919-2.patch
          against trunk revision 619744.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 3 new or modified tests.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac -1. The applied patch generated 590 javac compiler warnings (more than the trunk's current 589 warnings).

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs -1. The patch appears to introduce 5 new Findbugs warnings.

          core tests -1. The patch failed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1927/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1927/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1927/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1927/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12377508/2919-2.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac -1. The applied patch generated 590 javac compiler warnings (more than the trunk's current 589 warnings). release audit +1. The applied patch does not generate any new release audit warnings. findbugs -1. The patch appears to introduce 5 new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1927/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1927/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1927/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1927/console This message is automatically generated.
          Hide
          Chris Douglas added a comment -

          This patch makes some minor performance improvements, adds documentation, and correctly effects record compression in-place.

          The following should probably be implemented as separate JIRAs:

          • QuickSort would benefit from the optimization whereby keys equal to the pivot are swapped into place at the end of a pass.
          • Instead of recreating the spill thread, a persistent thread should accept spill events. This will permit one to set the spill threshold to less than 50% and avoid the overhead of creating a thread (assumed to be slight relative to the cost of a spill, but worth eliminating).
          • Recreating collectors is expensive. Pooling resources- particularly the collection buffers- between jobs (once JVM reuse is in place) should make a significant difference for jobs with short-running maps.
          Show
          Chris Douglas added a comment - This patch makes some minor performance improvements, adds documentation, and correctly effects record compression in-place. The following should probably be implemented as separate JIRAs: QuickSort would benefit from the optimization whereby keys equal to the pivot are swapped into place at the end of a pass. Instead of recreating the spill thread, a persistent thread should accept spill events. This will permit one to set the spill threshold to less than 50% and avoid the overhead of creating a thread (assumed to be slight relative to the cost of a spill, but worth eliminating). Recreating collectors is expensive. Pooling resources- particularly the collection buffers- between jobs (once JVM reuse is in place) should make a significant difference for jobs with short-running maps.
          Hide
          Chris Douglas added a comment -

          Updated patch (depends on HADOOP-2943). It should support large records now, but it remains a work in progress.

          Show
          Chris Douglas added a comment - Updated patch (depends on HADOOP-2943 ). It should support large records now, but it remains a work in progress.
          Hide
          Chris Douglas added a comment - - edited

          sort benchmarks:

          nodes trunk (r632182) trunk + patch
          100 31:09 25:46
          20 23:10 20:03
          Show
          Chris Douglas added a comment - - edited sort benchmarks: nodes trunk (r632182) trunk + patch 100 31:09 25:46 20 23:10 20:03
          Hide
          Chris Douglas added a comment -

          This patch effects the following changes to improve our efficiency in this area. Instead of gradually growing our buffers, we use properties to determine the size of the K,V byte buffer and accounting data and allocate it up front. We maintain accounting information for the task as two arrays of ints (rather than separate arrays for each partition), mimicking the existing BufferSorter interface. The first stores offsets into the second, which maintains the k/v offsets and partition information for the keys. This permits us to swap offsets to effect the sort, as is presently implemented in BufferSorter, but without requiring us to wrap them in IntWritables.

          kvoffset buffer        kvindices buffer
           _____________         _________________
          |offset k1,v1 |       | partition k1,v1 |
          |offset k1,v2 |       | k1 offset       |
               ...              | v1 offset       |
          |offset kn,vn |       | partition k2,v2 |
                                | k2 offset       |
                                | v2 offset       |
                                       ...
                                | partition kn,vn |
                                | kn offset       |
                                | vn offset       |
          

          By default, the total size of the accounting space is 5% of io.sort.mb. We build on the work done in HADOOP-1965, but rather than using 50% of io.sort.mb before a spill, we set a "soft" limit that defaults to 80% of the number of records or 80% of the K,V buffer before starting a spill thread. Note that this limit does not require us to query each partition collector for its memory usage, but can be effected by examining our indices. Rather than permitting the spill thread to "own" references to the buffers, we maintain a set of indices into the offset and k,v byte buffers defining the area of each in which the spill buffer is permitted to work. According to the Java VM spec, we can assume that reading/writing array elements does not require a lock on the array.

          We maintain three indices for both the accounting and k,v buffers: start, end, and index. The area between start and end is available to the spill, while the area between end and index (in truth, a marker noting end of the last record written) contains "spillable" data yet to be written to disk. If the soft limit is reached- or if one attempts a write into the buffer that is too large to accommodate without a spill- then the task thread sets the end index to the last record marker and triggers a spill. While the spill is running, the area between the start and end indices is unavailable for writing from collect(K,V) and the task thread will block until the spill has completed if the index marker hits the start marker.

          Buffer indices uring a spill:
           ___________      ___________      ___________
          |___________|    |___________|    |___________|
           ^     ^  ^ ^      ^  ^  ^  ^      ^  ^ ^   ^
           s     e  i v      i  s  e  v      e  i s   v
          

          It is worth mentioning that each key must be contiguous to be used with a RawComparator, but values can wrap around the end of the buffer. This requires us to note the "voided" space in the buffer that contains no data. When the spill completes, it sets the start marker to the end marker, making that space available for writing. Note that it must also reset the void marker to the buffer size if the spill wraps around the end of the buffer (the rightmost case in the preceding figure). The "voided" marker is owned by whichever thread needs to manipulate it, so we require no special locking for it.

          When we sort, we sort all spill data by partition instead of creating a separate collector for each partition. Further, we can use appendRaw (as was suggested in HADOOP-1609) to write our serialized data directly from the k,v buffer to our spill file writer instead of deserializing each prior to the write. Note that for record-compressed data (when not using a combiner), this permits us to store compressed values in our k,v buffer.

          The attached patch is a work in progress, and is known to suffer from the following deficiencies:

          • Very large keys and values (with a comparably small io.sort.mb) present a difficult problem for a statically allocated collection buffer. If a series of writes to an empty collection exceed the space allocated to the k,v byte buffer (e.g. a 100MB k,v byte buffer and a Writable that attempts 2 51MB write(byte[],int,int) calls), the current patch will loop forever. This will also happen for separate writes. The current patch only spills when the soft limit is reached.
          • Handling of compression is inelegantly implemented. Again, this is a work in progress and will be cleaned up.
          • The spill thread is created each time it is invoked, but it need not be.
          • The code managing the contiguous key property is not as efficient as it could be.
          • The implementation of QuickSort could be improved (re: Sedgewick) to handle the case where keys are equal to the pivot, probably a fairly common case.
          Show
          Chris Douglas added a comment - This patch effects the following changes to improve our efficiency in this area. Instead of gradually growing our buffers, we use properties to determine the size of the K,V byte buffer and accounting data and allocate it up front. We maintain accounting information for the task as two arrays of ints (rather than separate arrays for each partition), mimicking the existing BufferSorter interface. The first stores offsets into the second, which maintains the k/v offsets and partition information for the keys. This permits us to swap offsets to effect the sort, as is presently implemented in BufferSorter, but without requiring us to wrap them in IntWritables. kvoffset buffer kvindices buffer _____________ _________________ |offset k1,v1 | | partition k1,v1 | |offset k1,v2 | | k1 offset | ... | v1 offset | |offset kn,vn | | partition k2,v2 | | k2 offset | | v2 offset | ... | partition kn,vn | | kn offset | | vn offset | By default, the total size of the accounting space is 5% of io.sort.mb. We build on the work done in HADOOP-1965 , but rather than using 50% of io.sort.mb before a spill, we set a "soft" limit that defaults to 80% of the number of records or 80% of the K,V buffer before starting a spill thread. Note that this limit does not require us to query each partition collector for its memory usage, but can be effected by examining our indices. Rather than permitting the spill thread to "own" references to the buffers, we maintain a set of indices into the offset and k,v byte buffers defining the area of each in which the spill buffer is permitted to work. According to the Java VM spec, we can assume that reading/writing array elements does not require a lock on the array. We maintain three indices for both the accounting and k,v buffers: start, end, and index. The area between start and end is available to the spill, while the area between end and index (in truth, a marker noting end of the last record written) contains "spillable" data yet to be written to disk. If the soft limit is reached- or if one attempts a write into the buffer that is too large to accommodate without a spill- then the task thread sets the end index to the last record marker and triggers a spill. While the spill is running, the area between the start and end indices is unavailable for writing from collect(K,V) and the task thread will block until the spill has completed if the index marker hits the start marker. Buffer indices uring a spill: ___________ ___________ ___________ |___________| |___________| |___________| ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ s e i v i s e v e i s v It is worth mentioning that each key must be contiguous to be used with a RawComparator, but values can wrap around the end of the buffer. This requires us to note the "voided" space in the buffer that contains no data. When the spill completes, it sets the start marker to the end marker, making that space available for writing. Note that it must also reset the void marker to the buffer size if the spill wraps around the end of the buffer (the rightmost case in the preceding figure). The "voided" marker is owned by whichever thread needs to manipulate it, so we require no special locking for it. When we sort, we sort all spill data by partition instead of creating a separate collector for each partition. Further, we can use appendRaw (as was suggested in HADOOP-1609 ) to write our serialized data directly from the k,v buffer to our spill file writer instead of deserializing each prior to the write. Note that for record-compressed data (when not using a combiner), this permits us to store compressed values in our k,v buffer. The attached patch is a work in progress, and is known to suffer from the following deficiencies: Very large keys and values (with a comparably small io.sort.mb) present a difficult problem for a statically allocated collection buffer. If a series of writes to an empty collection exceed the space allocated to the k,v byte buffer (e.g. a 100MB k,v byte buffer and a Writable that attempts 2 51MB write(byte[],int,int) calls), the current patch will loop forever. This will also happen for separate writes. The current patch only spills when the soft limit is reached. Handling of compression is inelegantly implemented. Again, this is a work in progress and will be cleaned up. The spill thread is created each time it is invoked, but it need not be. The code managing the contiguous key property is not as efficient as it could be. The implementation of QuickSort could be improved (re: Sedgewick) to handle the case where keys are equal to the pivot, probably a fairly common case.

            People

            • Assignee:
              Chris Douglas
              Reporter:
              Chris Douglas
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development