Hadoop Common
  1. Hadoop Common
  2. HADOOP-5266

Values Iterator should support "mark" and "reset"

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Some users have expressed interest in having a mark-reset functionality on values iterator. Users can call mark() at any point during the iteration process and a subsequent reset() should move the iterator to the last value emitted when mark() was called.

      1. hadoop-5266-v4.patch
        56 kB
        Jothi Padmanabhan
      2. hadoop-5266-v3.patch
        56 kB
        Jothi Padmanabhan
      3. hadoop-5266-v2.patch
        54 kB
        Jothi Padmanabhan
      4. hadoop-5266-v1.patch
        54 kB
        Jothi Padmanabhan

        Issue Links

          Activity

          Hide
          Jothi Padmanabhan added a comment -

          Users would call values.mark() and values.reset() to use this functionality. However, the current Reduce API takes Iterator values as the parameter.

          public void reduce(K key, Iterator<V> values,
                                    OutputCollector<K, V> output, 
                                    Reporter reporter)
          

          Since the generic Iterator does not include functions mark and reset, how should this be handled? For example

          1. Change the API to take a ResettableIterator that extends Iterator and supports these two extra methods. However, this will break existing applications and so not an option
          2. Let the API take Iterator, but let the framework return a Resetabble Iterator and user code does a cast to ResettableIterator
          3. Some other way?

          Thoughts?

          Show
          Jothi Padmanabhan added a comment - Users would call values.mark() and values.reset() to use this functionality. However, the current Reduce API takes Iterator values as the parameter. public void reduce(K key, Iterator<V> values, OutputCollector<K, V> output, Reporter reporter) Since the generic Iterator does not include functions mark and reset, how should this be handled? For example Change the API to take a ResettableIterator that extends Iterator and supports these two extra methods. However, this will break existing applications and so not an option Let the API take Iterator, but let the framework return a Resetabble Iterator and user code does a cast to ResettableIterator Some other way? Thoughts?
          Hide
          Jothi Padmanabhan added a comment -

          Task#ValuesIterator is a place where this functionality would fit in nicely as this is where the value bytes are deserialized and the value object created. However, ValuesIterator is shared by both ReduceValuesIterator and CombineValuesIterator. Would the 'mark/reset' functionality be still relevant for CombineValuesIterator so that we can add this in ValuesIterator or should this be just restricted to ReduceValuesIterator? Thoughts?

          Show
          Jothi Padmanabhan added a comment - Task#ValuesIterator is a place where this functionality would fit in nicely as this is where the value bytes are deserialized and the value object created. However, ValuesIterator is shared by both ReduceValuesIterator and CombineValuesIterator. Would the 'mark/reset' functionality be still relevant for CombineValuesIterator so that we can add this in ValuesIterator or should this be just restricted to ReduceValuesIterator? Thoughts?
          Hide
          Doug Cutting added a comment -

          > Would the 'mark/reset' functionality be still relevant for CombineValuesIterator so that we can add this in ValuesIterator or should this be just restricted to ReduceValuesIterator?

          My instinct is to add a new interface like MarkableIterator or somesuch, that only ReduceValuesIterator implements for now. If we later find a reason to use it in CombineValuesIterator then we can implement it there then. I use "interface" loosely. Since this appears in end-user APIs, it would better permit compatible evolution if we use an abstract class. It's not absolutely critical in this case, since we don't expect many users to implement this interface, but, e.g., a FilterReducer might reasonably implement it, and an abstract class would thus permit us to change this in ways that an interface would not without breaking user code.

          Show
          Doug Cutting added a comment - > Would the 'mark/reset' functionality be still relevant for CombineValuesIterator so that we can add this in ValuesIterator or should this be just restricted to ReduceValuesIterator? My instinct is to add a new interface like MarkableIterator or somesuch, that only ReduceValuesIterator implements for now. If we later find a reason to use it in CombineValuesIterator then we can implement it there then. I use "interface" loosely. Since this appears in end-user APIs, it would better permit compatible evolution if we use an abstract class. It's not absolutely critical in this case, since we don't expect many users to implement this interface, but, e.g., a FilterReducer might reasonably implement it, and an abstract class would thus permit us to change this in ways that an interface would not without breaking user code.
          Hide
          Owen O'Malley added a comment -

          This shouldn't be done for the old api. It should only be done on the mapreduce api.

          Show
          Owen O'Malley added a comment - This shouldn't be done for the old api. It should only be done on the mapreduce api.
          Hide
          Jothi Padmanabhan added a comment -

          Attaching a first version for review

          Show
          Jothi Padmanabhan added a comment - Attaching a first version for review
          Hide
          Jothi Padmanabhan added a comment -

          The attached patch supports for both the old and new API. If this needs to be done only for the new API, I could remove the relevant changes for the old api in the next iteration.

          Show
          Jothi Padmanabhan added a comment - The attached patch supports for both the old and new API. If this needs to be done only for the new API, I could remove the relevant changes for the old api in the next iteration.
          Hide
          Doug Cutting added a comment -

          Some nits from a cursory glance at the code:

          • MarkableIterable lacks javadoc, which as a public interface, is required.
          • Does BackupStore need to be public? I can't see why it does.
          • Does ReduceContext#backupStore need to be protected, not private?

          Should we benchmark this to make sure that it doesn't measurably slow things even when disabled?

          Show
          Doug Cutting added a comment - Some nits from a cursory glance at the code: MarkableIterable lacks javadoc, which as a public interface, is required. Does BackupStore need to be public? I can't see why it does. Does ReduceContext#backupStore need to be protected, not private? Should we benchmark this to make sure that it doesn't measurably slow things even when disabled?
          Hide
          Jothi Padmanabhan added a comment -

          Thanks for those initial comments. I will add javadoc for MarkableIterable and make ReduceContext#backupStore private.
          The reason why BackupStore is public is because I need to access this from the mapreduce package as well.

          Show
          Jothi Padmanabhan added a comment - Thanks for those initial comments. I will add javadoc for MarkableIterable and make ReduceContext#backupStore private. The reason why BackupStore is public is because I need to access this from the mapreduce package as well.
          Hide
          Devaraj Das added a comment -

          Some points:
          1. Put a comment around IFile.Writer.close() for the keyClass!=null check
          add the clear in the MarkableIterator interface
          2. A Counter for the number of times values are iterated over would be nice to have
          You probably can improve the implementation of how you write the firstkeybytes/firstvaluebytes by passing the Serializer the stream corresponding to the BackupStore as opposed to making a DataOutputBuffer copy of the bytes. Granted this is happening only for the first key/value bytes after a mark is called. But maybe it makes sense to keep the implementation tight if it doesn't mess up the code a lot.
          3. Remove values.clear() from the ReduceValuesIterator iteration
          4. Task.ValuesIterator.readNextValue should do "nextValueBytes.getLength() - nextValueBytes.getPosition()" to get the length?
          5. The size for the MemoryCache in BackupStore should probably be a fraction of mapred.job.reduce.input.buffer.percent.

          Show
          Devaraj Das added a comment - Some points: 1. Put a comment around IFile.Writer.close() for the keyClass!=null check add the clear in the MarkableIterator interface 2. A Counter for the number of times values are iterated over would be nice to have You probably can improve the implementation of how you write the firstkeybytes/firstvaluebytes by passing the Serializer the stream corresponding to the BackupStore as opposed to making a DataOutputBuffer copy of the bytes. Granted this is happening only for the first key/value bytes after a mark is called. But maybe it makes sense to keep the implementation tight if it doesn't mess up the code a lot. 3. Remove values.clear() from the ReduceValuesIterator iteration 4. Task.ValuesIterator.readNextValue should do "nextValueBytes.getLength() - nextValueBytes.getPosition()" to get the length? 5. The size for the MemoryCache in BackupStore should probably be a fraction of mapred.job.reduce.input.buffer.percent.
          Hide
          Jothi Padmanabhan added a comment -

          New patch incorporating review comments. Made some additional changes based on offline comments from Devaraj and Owen. Some points:

          1. Functionality is supported only for the new API
          2. User creates a MarkableIterator that takes the original iterator in its constructor. See TestValueIterReset.java
          3. Even though the functionality is supported only for new API, BackupStore.java is in mapred package as it access IFile and Merger.Segment which are package private
          Show
          Jothi Padmanabhan added a comment - New patch incorporating review comments. Made some additional changes based on offline comments from Devaraj and Owen. Some points: Functionality is supported only for the new API User creates a MarkableIterator that takes the original iterator in its constructor. See TestValueIterReset.java Even though the functionality is supported only for new API, BackupStore.java is in mapred package as it access IFile and Merger.Segment which are package private
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12406353/hadoop-5266-v2.patch
          against trunk revision 768376.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/244/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/244/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/244/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/244/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12406353/hadoop-5266-v2.patch against trunk revision 768376. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/244/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/244/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/244/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/244/console This message is automatically generated.
          Hide
          Devaraj Das added a comment -

          1) Minor : IFile.Writer has a new constructor. This can directly invoke an existing constructor with nulls for some arguments.
          2) Use a flag to signify EOF has been reached instead of handling EOFException explicitly in the BackUpStore.hasNext method
          3) The casting in Reducer.run is not required for the context argument.
          4) Remove the checkBaseIterator calls in MarkableIterator.java
          5) MarkableInterface can be package private
          6) ReduceContext.nextKeyValue should reuse nextKey/nextVal instead of allocating new DIB objects
          7) ReduceContext.resetBackupStore can go away. The caller of that can directly call the resetBackupStore on the iterator object.
          8) ReduceContext.createBackupStore can be moved to the iterator as a method
          9) Move the call to Segment.init from BackupStore.FileCache.createInDiskSegment to BackupStore.hasNext
          10) Is the inReset check in BackupStore.MemoryCache.reserveSpace(int) ?
          11) Improve the testcase by storing the values in an array during the first iteration, and verifying against the values obtained during the iteration after reset.

          Show
          Devaraj Das added a comment - 1) Minor : IFile.Writer has a new constructor. This can directly invoke an existing constructor with nulls for some arguments. 2) Use a flag to signify EOF has been reached instead of handling EOFException explicitly in the BackUpStore.hasNext method 3) The casting in Reducer.run is not required for the context argument. 4) Remove the checkBaseIterator calls in MarkableIterator.java 5) MarkableInterface can be package private 6) ReduceContext.nextKeyValue should reuse nextKey/nextVal instead of allocating new DIB objects 7) ReduceContext.resetBackupStore can go away. The caller of that can directly call the resetBackupStore on the iterator object. 8) ReduceContext.createBackupStore can be moved to the iterator as a method 9) Move the call to Segment.init from BackupStore.FileCache.createInDiskSegment to BackupStore.hasNext 10) Is the inReset check in BackupStore.MemoryCache.reserveSpace(int) ? 11) Improve the testcase by storing the values in an array during the first iteration, and verifying against the values obtained during the iteration after reset.
          Hide
          Jothi Padmanabhan added a comment -

          Patch incorporating all the review comments

          Show
          Jothi Padmanabhan added a comment - Patch incorporating all the review comments
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12406595/hadoop-5266-v3.patch
          against trunk revision 769174.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/255/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/255/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/255/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/255/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12406595/hadoop-5266-v3.patch against trunk revision 769174. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/255/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/255/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/255/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/255/console This message is automatically generated.
          Hide
          Jothi Padmanabhan added a comment -

          The test failure org.apache.hadoop.io.TestUTF8.testIO passes in my local box and this test is unrelated to the patch

          Show
          Jothi Padmanabhan added a comment - The test failure org.apache.hadoop.io.TestUTF8.testIO passes in my local box and this test is unrelated to the patch
          Hide
          Jothi Padmanabhan added a comment -

          This patch fixes a bug related to lazy opening of the segments from the BackupStore present in the previous patch.

          Show
          Jothi Padmanabhan added a comment - This patch fixes a bug related to lazy opening of the segments from the BackupStore present in the previous patch.
          Hide
          Jothi Padmanabhan added a comment -

          Test patch and ant test passed in my local machine

          Show
          Jothi Padmanabhan added a comment - Test patch and ant test passed in my local machine
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12406756/hadoop-5266-v4.patch
          against trunk revision 770549.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/268/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/268/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/268/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/268/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12406756/hadoop-5266-v4.patch against trunk revision 770549. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/268/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/268/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/268/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/268/console This message is automatically generated.
          Hide
          Devaraj Das added a comment -

          I just committed this. Thanks, Jothi!

          Show
          Devaraj Das added a comment - I just committed this. Thanks, Jothi!
          Hide
          Hudson added a comment -

          Integrated in Hadoop-trunk #826 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/826/)
          . Adds the capability to do mark/reset of the reduce values iterator in the Context object API. Contributed by Jothi Padmanabhan.

          Show
          Hudson added a comment - Integrated in Hadoop-trunk #826 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/826/ ) . Adds the capability to do mark/reset of the reduce values iterator in the Context object API. Contributed by Jothi Padmanabhan.

            People

            • Assignee:
              Jothi Padmanabhan
              Reporter:
              Jothi Padmanabhan
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development