Hadoop Common
  1. Hadoop Common
  2. HADOOP-5494

IFile.Reader should have a nextRawKey/nextRawValue

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.18.0
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Merger.Segment has only the next() method defined which internally calls next(key,value) on the underlying IFile stream. This would read both the key and the value bytes. It would be good to have Merger.Segment.nextRawKey(), that would read only the key and delay reading the value until needed (in Merger.MergeQueue.next()) via a new method Merger.Segment.nextRawValue().
      This would mean that we load only one value bytes at a time, and hence would incur potentially much less (depending on how big the values are) on the memory footprint.

      1. 5494-1.patch
        13 kB
        Devaraj Das
      2. 5494-2.patch
        14 kB
        Devaraj Das
      3. 5494-3.patch
        15 kB
        Devaraj Das
      4. 5494-4.patch
        15 kB
        Devaraj Das

        Issue Links

          Activity

          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #811 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/811/ )
          Hide
          Chris Douglas added a comment -

          I committed this. Thanks, Devaraj

          Show
          Chris Douglas added a comment - I committed this. Thanks, Devaraj
          Hide
          Chris Douglas added a comment -

          +1

          Show
          Chris Douglas added a comment - +1
          Hide
          Devaraj Das added a comment -

          ant tests/test-patch passed on my box.

          Show
          Devaraj Das added a comment - ant tests/test-patch passed on my box.
          Hide
          Devaraj Das added a comment -

          Fixes the nits raised by Chris.

          Show
          Devaraj Das added a comment - Fixes the nits raised by Chris.
          Hide
          Chris Douglas added a comment -

          Just a couple small nits:

          • The comment in IFile::nextRawKey should read "Position for the value" instead of "Position for next record"
          • The field dataIn in InMemoryReader hides the dataIn field in its superclass. While it looks like its use in the current patch is OK, this could cause some confusing behavior for future modifications.
          Show
          Chris Douglas added a comment - Just a couple small nits: The comment in IFile::nextRawKey should read "Position for the value" instead of "Position for next record" The field dataIn in InMemoryReader hides the dataIn field in its superclass. While it looks like its use in the current patch is OK, this could cause some confusing behavior for future modifications.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12404671/5494-3.patch
          against trunk revision 761838.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/115/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/115/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/115/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/115/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12404671/5494-3.patch against trunk revision 761838. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/115/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/115/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/115/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/115/console This message is automatically generated.
          Hide
          Devaraj Das added a comment -

          Passing the last through hudson.

          Show
          Devaraj Das added a comment - Passing the last through hudson.
          Hide
          Devaraj Das added a comment -

          Attaching an updated patch after some large scale tests. The earlier patch had a bug to do with reusing the "value" DataInputBuffer across in-memory and on-disk segments. Thanks to Chris for helping me trace this bug. No negative performance impact was observed in my runs.

          Show
          Devaraj Das added a comment - Attaching an updated patch after some large scale tests. The earlier patch had a bug to do with reusing the "value" DataInputBuffer across in-memory and on-disk segments. Thanks to Chris for helping me trace this bug. No negative performance impact was observed in my runs.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12403785/5494-2.patch
          against trunk revision 759398.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/73/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/73/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/73/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/73/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12403785/5494-2.patch against trunk revision 759398. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/73/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/73/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/73/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/73/console This message is automatically generated.
          Hide
          Devaraj Das added a comment -

          Pushing it through hudson, while I do the perf bench.

          Show
          Devaraj Das added a comment - Pushing it through hudson, while I do the perf bench.
          Hide
          Devaraj Das added a comment -

          Thanks for the review, Chris! Attached is the updated patch.

          Show
          Devaraj Das added a comment - Thanks for the review, Chris! Attached is the updated patch.
          Hide
          Chris Douglas added a comment -
          • In Merger.MergeQueue::next, there's a null check for the value. Is there any reason why value can't be final and initialized in the cstr, since the assignment is always to itself?
          • Merger.Segment::getValue can be removed, as can its value member field
          • Initializing valBytes to 0 and growing it only enough to fit currentValueLength seems too conservative
          • Won't this have the same effect as the current code? Even though the read is deferred, each IFile.Reader will still grow and keep large buffers. It won't grow to some multiple of the large record, but it will still have a memory problem because each reader is maintaining its own, growing buffer. By not growing exponentially, the current patch will have many more allocations, too.

          If the IFile.Reader from disk isn't doing any of its own buffering, then it shouldn't need to keep and grow its own DIB for the value bytes. It should accept the DIB from the Merger.MergeQueue and grow that, instead, since the byte[] ref wrapped by the DIB is invalid after being passed to IFile.Reader::nextRawValue, anyway.

          Show
          Chris Douglas added a comment - In Merger.MergeQueue::next , there's a null check for the value. Is there any reason why value can't be final and initialized in the cstr, since the assignment is always to itself? Merger.Segment::getValue can be removed, as can its value member field Initializing valBytes to 0 and growing it only enough to fit currentValueLength seems too conservative Won't this have the same effect as the current code? Even though the read is deferred, each IFile.Reader will still grow and keep large buffers. It won't grow to some multiple of the large record, but it will still have a memory problem because each reader is maintaining its own, growing buffer. By not growing exponentially, the current patch will have many more allocations, too. If the IFile.Reader from disk isn't doing any of its own buffering, then it shouldn't need to keep and grow its own DIB for the value bytes. It should accept the DIB from the Merger.MergeQueue and grow that, instead, since the byte[] ref wrapped by the DIB is invalid after being passed to IFile.Reader::nextRawValue , anyway.
          Hide
          Devaraj Das added a comment -

          This patch (an early one that still needs large scale testing) does the following:
          1) Removes the method next(DataInputBuffer, DataInputBuffer) from the Merger.Segment class and the IFile.Reader classes
          2) nextRawKey and nextRawValue are defined in those classes
          3) nextRawValue is called in Merger.MergeQueue.next() and the DataInputBuffer passed is allocated memory then (true for IFile.Reader class's implementation; the other case IFile.InMemoryReader is the case where the value is in memory already)
          4) Removes the IFile buffering that it does in addition to FileSystem's buffering. The FileSystem level buffering should be sufficient.

          The other thing that can be done is to have one next method that takes a key DataInputBuffer and returns two things - the filled up key DataInputBuffer and a stream for the value (a DataInput) without actually allocating memory upfront for the value (again, true for only the IFile.Reader case). But that can probably be a follow up jira.

          Show
          Devaraj Das added a comment - This patch (an early one that still needs large scale testing) does the following: 1) Removes the method next(DataInputBuffer, DataInputBuffer) from the Merger.Segment class and the IFile.Reader classes 2) nextRawKey and nextRawValue are defined in those classes 3) nextRawValue is called in Merger.MergeQueue.next() and the DataInputBuffer passed is allocated memory then (true for IFile.Reader class's implementation; the other case IFile.InMemoryReader is the case where the value is in memory already) 4) Removes the IFile buffering that it does in addition to FileSystem's buffering. The FileSystem level buffering should be sufficient. The other thing that can be done is to have one next method that takes a key DataInputBuffer and returns two things - the filled up key DataInputBuffer and a stream for the value (a DataInput) without actually allocating memory upfront for the value (again, true for only the IFile.Reader case). But that can probably be a follow up jira.

            People

            • Assignee:
              Devaraj Das
              Reporter:
              Devaraj Das
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development