Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.0.0-alpha, 3.0.0, 2.0.2-alpha
    • Fix Version/s: 2.0.3-alpha
    • Component/s: None
    • Target Version/s:
    • Release Note:
      MAPREDUCE-4807 Allow external implementations of the sort phase in a Map task

      Description

      Define interfaces and some abstract classes in the Hadoop framework to facilitate external sorter plugins both on the Map and Reduce sides.

      1. HadoopSortPlugin.pdf
        56 kB
        Mariappan Asokan
      2. HadoopSortPlugin.pdf
        61 kB
        Mariappan Asokan
      3. KeyValueIterator.java
        1 kB
        Mariappan Asokan
      4. MapOutputSorter.java
        1 kB
        Mariappan Asokan
      5. MapOutputSorterAbstract.java
        5 kB
        Mariappan Asokan
      6. mapreduce-2454.patch
        106 kB
        Mariappan Asokan
      7. mapreduce-2454.patch
        106 kB
        Mariappan Asokan
      8. mapreduce-2454.patch
        105 kB
        Mariappan Asokan
      9. mapreduce-2454.patch
        104 kB
        Mariappan Asokan
      10. mapreduce-2454.patch
        104 kB
        Mariappan Asokan
      11. mapreduce-2454.patch
        102 kB
        Mariappan Asokan
      12. mapreduce-2454.patch
        69 kB
        Mariappan Asokan
      13. mapreduce-2454.patch
        68 kB
        Mariappan Asokan
      14. mapreduce-2454.patch
        68 kB
        Mariappan Asokan
      15. mapreduce-2454.patch
        281 kB
        Mariappan Asokan
      16. mapreduce-2454.patch
        281 kB
        Mariappan Asokan
      17. mapreduce-2454.patch
        281 kB
        Mariappan Asokan
      18. mapreduce-2454.patch
        280 kB
        Mariappan Asokan
      19. mapreduce-2454.patch
        280 kB
        Mariappan Asokan
      20. mapreduce-2454.patch
        280 kB
        Mariappan Asokan
      21. mapreduce-2454.patch
        280 kB
        Mariappan Asokan
      22. mapreduce-2454.patch
        280 kB
        Mariappan Asokan
      23. mapreduce-2454.patch
        246 kB
        Mariappan Asokan
      24. mapreduce-2454.patch
        250 kB
        Mariappan Asokan
      25. mapreduce-2454.patch
        248 kB
        Mariappan Asokan
      26. mapreduce-2454.patch
        244 kB
        Mariappan Asokan
      27. mapreduce-2454-modified-code.patch
        67 kB
        Mariappan Asokan
      28. mapreduce-2454-modified-test.patch
        1 kB
        Mariappan Asokan
      29. mapreduce-2454-new-test.patch
        33 kB
        Mariappan Asokan
      30. mapreduce-2454-protection-change.patch
        6 kB
        Mariappan Asokan
      31. mr-2454-on-mr-279-build82.patch.gz
        44 kB
        Mariappan Asokan
      32. MR-2454-trunkPatchPreview.gz
        48 kB
        Mariappan Asokan
      33. ReduceInputSorter.java
        5 kB
        Mariappan Asokan

        Issue Links

        There are no Sub-Tasks for this issue.

          Activity

          Hide
          Mariappan Asokan added a comment -

          Sorry about that Chris. Thanks for moving.

          – Asokan

          On 04/26/2011 03:01 PM, Chris Douglas (JIRA) wrote:

          [ https://issues.apache.org/jira/browse/MAPREDUCE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

          Chris Douglas moved HADOOP-7242 to MAPREDUCE-2454:
          --------------------------------------------------

          Affects Version/s: (was: 0.21.0)
          Key: MAPREDUCE-2454 (was: HADOOP-7242)
          Project: Hadoop Map/Reduce (was: Hadoop Common)

          Allow external sorter plugin for MR
          -----------------------------------

          Key: MAPREDUCE-2454
          URL: https://issues.apache.org/jira/browse/MAPREDUCE-2454
          Project: Hadoop Map/Reduce
          Issue Type: New Feature
          Reporter: Mariappan Asokan
          Priority: Minor

          Define interfaces and some abstract classes in the Hadoop framework to facilitate external sorter plugins both on the Map and Reduce sides.


          This message is automatically generated by JIRA.
          For more information on JIRA, see: http://www.atlassian.com/software/jira

          Show
          Mariappan Asokan added a comment - Sorry about that Chris. Thanks for moving. – Asokan On 04/26/2011 03:01 PM, Chris Douglas (JIRA) wrote: [ https://issues.apache.org/jira/browse/MAPREDUCE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Douglas moved HADOOP-7242 to MAPREDUCE-2454 : -------------------------------------------------- Affects Version/s: (was: 0.21.0) Key: MAPREDUCE-2454 (was: HADOOP-7242 ) Project: Hadoop Map/Reduce (was: Hadoop Common) Allow external sorter plugin for MR ----------------------------------- Key: MAPREDUCE-2454 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2454 Project: Hadoop Map/Reduce Issue Type: New Feature Reporter: Mariappan Asokan Priority: Minor Define interfaces and some abstract classes in the Hadoop framework to facilitate external sorter plugins both on the Map and Reduce sides. – This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
          Hide
          Owen O'Malley added a comment -

          You should also look at the work in MAPREDUCE-279. Once the MapReduce library is user code there are a lot more options available.

          Show
          Owen O'Malley added a comment - You should also look at the work in MAPREDUCE-279 . Once the MapReduce library is user code there are a lot more options available.
          Hide
          Mariappan Asokan added a comment -

          Hi Owen,
          Thank you very much for your suggestion. Originally, I was experimenting with
          my code on a Cloudera distribution which is based on Apache Hadoop 0.20.2. I
          added most of my code to the mapred package. We did some extensive testing with
          an external sorter plugin and found the results very encouraging.

          It is really exciting to see where Hadoop is heading for the long term. The
          contribution we are making will be useful even when all the Task related classes
          are visible as public and will live out of the core packages.

          I am giving more details on the proposal below. Please feel free to comment on.

          The idea is to bypass the framework's sorting on both the Map and Reduce sides.
          On the Map side, it is very easy. Just define a public interface extending the
          MapOutputCollector. Please see the attached file MapOutputSorter.java.

          An abstract class called MapOutputSorterAbstract(implementing MapOutputSorter)
          will be provided which acts like a conduit to invoke methods in package
          protected classes in the mapred package. I guess once Hadoop evolves and pulls
          out Task related classes from the core package, this abstract class may be
          unnecessary but is harmless. The abstract class exposes methods to send
          progress message, to get a Counter object, to run the Combiner, to get a Map
          output file to write to, and to get a Map index file to write to. These methods
          are very thin in the sense that they use simple delegation.

          On the Reduce side, defining an external sorter interface is a bit tricky.
          Please refer to the attached file ReduceInputSorter.java for details.

          Again there will be an abstract base class ReduceInputSorterAbstract
          (implementing ReduceInputSorter) which can be extended by users to implement
          the external sorter in the reduce phase. This abstract class provides methods to
          send progress message, to get Counter objects, and to update shuffle client
          metrics. I had to modify MapTask.java, ReduceTask.java, Shuffle.java,
          Fetcher.java, and MapOutput.java to accommodate the external sorter.

          If I can work with an Apache committer, I will be more than happy to discuss the
          details of all code changes. When I moved my code from Cloudera distribution to
          Apache 0.21.0, I noticed some code refactoring that went in ReduceTask.java(for
          good.) I am still merging the changes and it may take a week or two to test it
          in-house before the code can be tested formally and submitted for review. As
          far as packaging is concerned, I will try to define most of the classes in
          mapreduce package rather than mapred package(as I did in the Cloudera
          distribution.)

          I appreciate an early feedback on this from everyone.

          Thank you very much.

          – Asokan

          Show
          Mariappan Asokan added a comment - Hi Owen, Thank you very much for your suggestion. Originally, I was experimenting with my code on a Cloudera distribution which is based on Apache Hadoop 0.20.2. I added most of my code to the mapred package. We did some extensive testing with an external sorter plugin and found the results very encouraging. It is really exciting to see where Hadoop is heading for the long term. The contribution we are making will be useful even when all the Task related classes are visible as public and will live out of the core packages. I am giving more details on the proposal below. Please feel free to comment on. The idea is to bypass the framework's sorting on both the Map and Reduce sides. On the Map side, it is very easy. Just define a public interface extending the MapOutputCollector. Please see the attached file MapOutputSorter.java. An abstract class called MapOutputSorterAbstract(implementing MapOutputSorter) will be provided which acts like a conduit to invoke methods in package protected classes in the mapred package. I guess once Hadoop evolves and pulls out Task related classes from the core package, this abstract class may be unnecessary but is harmless. The abstract class exposes methods to send progress message, to get a Counter object, to run the Combiner, to get a Map output file to write to, and to get a Map index file to write to. These methods are very thin in the sense that they use simple delegation. On the Reduce side, defining an external sorter interface is a bit tricky. Please refer to the attached file ReduceInputSorter.java for details. Again there will be an abstract base class ReduceInputSorterAbstract (implementing ReduceInputSorter) which can be extended by users to implement the external sorter in the reduce phase. This abstract class provides methods to send progress message, to get Counter objects, and to update shuffle client metrics. I had to modify MapTask.java, ReduceTask.java, Shuffle.java, Fetcher.java, and MapOutput.java to accommodate the external sorter. If I can work with an Apache committer, I will be more than happy to discuss the details of all code changes. When I moved my code from Cloudera distribution to Apache 0.21.0, I noticed some code refactoring that went in ReduceTask.java(for good.) I am still merging the changes and it may take a week or two to test it in-house before the code can be tested formally and submitted for review. As far as packaging is concerned, I will try to define most of the classes in mapreduce package rather than mapred package(as I did in the Cloudera distribution.) I appreciate an early feedback on this from everyone. Thank you very much. – Asokan
          Hide
          Steve Loughran added a comment -
          1. All new features should be on SVN-trunk, and with the plan to put MAPREDUCE-279 in there, I would strongly encourage you to get involved with the '279 project and make sure your needs are met there.
          1. It won't be enough to provide an interface, with one (private) implementation behind it. Nobody is going to support that. What would be better would be to have at least one implementation of the plugin in the main codebase ideally it would be how the normal sorter would work along with the tests to verify that everything works.
          1. The interfaces may want to extend Closeable, so they can be closed at the end of their work.
          Show
          Steve Loughran added a comment - All new features should be on SVN-trunk, and with the plan to put MAPREDUCE-279 in there, I would strongly encourage you to get involved with the '279 project and make sure your needs are met there. It won't be enough to provide an interface, with one (private) implementation behind it. Nobody is going to support that. What would be better would be to have at least one implementation of the plugin in the main codebase ideally it would be how the normal sorter would work along with the tests to verify that everything works. The interfaces may want to extend Closeable , so they can be closed at the end of their work.
          Hide
          Owen O'Malley added a comment -

          I suspect the right interfaces are a subset of the current mapreduce.RecordReader and RecordWriter. In particular, they would look like:

          public abstract class RecordWriter {
            public abstract void write(Object key, Object value
                                       ) throws IOException, InterruptedException;
            public abstract void close() throws IOException, InterruptedException;
          }
          
          public abstract class RecordReader {
            public abstract 
            boolean nextKeyValue() throws IOException, InterruptedException;
          
            public abstract
            Object getCurrentKey() throws IOException, InterruptedException;
            
            public abstract 
            Object getCurrentValue() throws IOException, InterruptedException;
          
            public abstract void close() throws IOException, InterruptedException;
          }
          

          Making the current shuffle code implement these classes would take work, but be doable.

          I'll also take the chance to suggest Java's ServiceLoader library as the right way to configure the plugin.

          Show
          Owen O'Malley added a comment - I suspect the right interfaces are a subset of the current mapreduce.RecordReader and RecordWriter. In particular, they would look like: public abstract class RecordWriter { public abstract void write( Object key, Object value ) throws IOException, InterruptedException; public abstract void close() throws IOException, InterruptedException; } public abstract class RecordReader { public abstract boolean nextKeyValue() throws IOException, InterruptedException; public abstract Object getCurrentKey() throws IOException, InterruptedException; public abstract Object getCurrentValue() throws IOException, InterruptedException; public abstract void close() throws IOException, InterruptedException; } Making the current shuffle code implement these classes would take work, but be doable. I'll also take the chance to suggest Java's ServiceLoader library as the right way to configure the plugin.
          Hide
          Mariappan Asokan added a comment -

          Hi Steve,
          Thanks for your comments. I will definitely take a look at the MAPREDUCE-279 branch. I will be providing a public implementation of the interface. It will use GNU sort command as the external sorter. Sure, I will extend my interfaces from Closeable.

          – Asokan

          Show
          Mariappan Asokan added a comment - Hi Steve, Thanks for your comments. I will definitely take a look at the MAPREDUCE-279 branch. I will be providing a public implementation of the interface. It will use GNU sort command as the external sorter. Sure, I will extend my interfaces from Closeable. – Asokan
          Hide
          Mariappan Asokan added a comment -

          Hi Owen,
          Thank you for your comments. Here are my thoughts.

          Map

          On the Map side, the external sorter would also need the partition number, notjust the key and value. I am not sure how RecordWriter can be used. In the current MapTask, the sorting starts in MapOutputBuffer which implements
          MapOutputCollector. I thought it is natural for an external sorter to extendMapOutputCollector interface as well. Perhaps, following Steve's suggestion we can rewrite MapOutputCollector as:

          public interface MapOutputCollector<K, V> extends Closeable

          { public void collect(K key, V value, int partition) throws IOException, InterruptedException; public void flush() throws IOException, InterruptedException; }

          At present, "extends Closeable" is missing.

          A digression
          ------------
          If you treat the framework's sorter as a black box, it accepts a set of key and value pairs but produces raw key and value pairs. There is an asymmetry.

          An external sorter may not produce a RawKeyValueIterator(due to its own serialization mechanism - for example if records are piped to GNU sort, the key and value may be serialized with a TAB separator between them.) If an
          external sorter would like to use CombinerRunner classes defined in Task.java, it cannot do so without incurring an additional data move. I was looking for an iterator that will return simple key and values. I could not find any that is efficient. The RecordReader looks appropriate functionally, but is not efficient when passed to ValuesIterator(defined in Task.java) which gets used as part of running a Combiner. The key and values returned from RecordReader will have to be copied. Any kind of such data move will affect the performance especially when dealing with huge volume of data.

          I had to come up with a simple key, value iterator as below:

          public interface KeyValueIterator<K, V> extends Closeable

          { /** * Get the current key. * * @param key where the current key should be stored. If this is null, a new * instance will be created and returned. * @param key current key */ K getCurrentKey(K key) throws IOException, InterruptedException; /** * Get the current value. * * @param value where the current value should be stored. If this is null, a * new instance will be created and returned. * @return value current value. */ V getCurrentValue(V value) throws IOException, InterruptedException; /** * Set up to get the current key and value (for getKey() and getValue()). * * @return <code>true</code> if there exists a key/value, <code>false</code> * otherwise. */ boolean nextKeyValue() throws IOException, InterruptedException; /** * Get the Progress object; this has a float (0.0 - 1.0) indicating the bytes * processed by the iterator so far. * @return progress object. */ Progress getProgress(); }

          I was able to wrap the framework's RawKeyValueIterator inside an implementation of KeyValueIterator without additional data move. This makes sure that anything outside the sorter sees only the simple key, value iterator. The serialized representation stays internal to the sorter black box. The external sorter is also happy as it does not incur any extra data move

          The abstract base class and MAPREDUCE-279
          -----------------------------------------
          As I mentioned in my previous post, I created an abstract base class called MapOutputSorterAbstract(I am attaching the source to Jira-2454) in order to access package protected class methods. I would appreciate if developers
          familiar with MAPREDUCE-279 can take a look at the class and comment on whether the class can live completely outside the framework. In MapTask.java, I needed to change the access of APPROX_HEADER_LENGTH to package public from private.

          My specific questions are:

          If TaskReporter, Counter, MapOutputFile be accessible as public classes, how can they be passed to external sorter from MapTask.java?

          Will CombinerRunner as defined in Task.java be available?(I had to change the access to public.)

          The classes SpillRecord and IndexRecord should also be made public. Since IndexRecord is not an inner class of SpillRecord, I created another file IndexRecord.java and moved the code there.

          Reduce
          ------
          On the Reduce side, I was trying to come up with an interface that can be implemented by both the framework as well as an external sorter. It was not easy to decouple shuffling and Merger since the shuffle is driving the Merge not
          the other way around. Since I wanted to reuse the framework's shuffle code, I ended up using a few ugly if's so that data is shuffled either to the framework's Merger or to an external sorter.

          If the shuffle code can somehow be invoked from outside the core packages using public interfaces, the external sorter on the Reduce side can just implement a simple key, value iterator. I think this might require some inversion of control and code rewrite in some sensitive areas.

          On ServiceLoader:
          -----------------
          I thought of taking a simple approach: Users can configure a job with the name of the external sorter classes in configuration parameters like mapred.map.externalsort.class and mapred.reduce.externalsort.class and use
          simple Java class loader to load the class. This is very similar to configuring a Mapper class for example. Am my missing something? Is there a strong reason to use ServiceLoader?

          Please give me your feedback.

          If developers do not get the full picture of what I was playing with, I can try to make my changes locally on top of MAPREDUCE-279 branch and post a patch file.

          Thanks everyone for your patience.

          – Asokan

          Show
          Mariappan Asokan added a comment - Hi Owen, Thank you for your comments. Here are my thoughts. Map — On the Map side, the external sorter would also need the partition number, notjust the key and value. I am not sure how RecordWriter can be used. In the current MapTask, the sorting starts in MapOutputBuffer which implements MapOutputCollector. I thought it is natural for an external sorter to extendMapOutputCollector interface as well. Perhaps, following Steve's suggestion we can rewrite MapOutputCollector as: public interface MapOutputCollector<K, V> extends Closeable { public void collect(K key, V value, int partition) throws IOException, InterruptedException; public void flush() throws IOException, InterruptedException; } At present, "extends Closeable" is missing. A digression ------------ If you treat the framework's sorter as a black box, it accepts a set of key and value pairs but produces raw key and value pairs. There is an asymmetry. An external sorter may not produce a RawKeyValueIterator(due to its own serialization mechanism - for example if records are piped to GNU sort, the key and value may be serialized with a TAB separator between them.) If an external sorter would like to use CombinerRunner classes defined in Task.java, it cannot do so without incurring an additional data move. I was looking for an iterator that will return simple key and values. I could not find any that is efficient. The RecordReader looks appropriate functionally, but is not efficient when passed to ValuesIterator(defined in Task.java) which gets used as part of running a Combiner. The key and values returned from RecordReader will have to be copied. Any kind of such data move will affect the performance especially when dealing with huge volume of data. I had to come up with a simple key, value iterator as below: public interface KeyValueIterator<K, V> extends Closeable { /** * Get the current key. * * @param key where the current key should be stored. If this is null, a new * instance will be created and returned. * @param key current key */ K getCurrentKey(K key) throws IOException, InterruptedException; /** * Get the current value. * * @param value where the current value should be stored. If this is null, a * new instance will be created and returned. * @return value current value. */ V getCurrentValue(V value) throws IOException, InterruptedException; /** * Set up to get the current key and value (for getKey() and getValue()). * * @return <code>true</code> if there exists a key/value, <code>false</code> * otherwise. */ boolean nextKeyValue() throws IOException, InterruptedException; /** * Get the Progress object; this has a float (0.0 - 1.0) indicating the bytes * processed by the iterator so far. * @return progress object. */ Progress getProgress(); } I was able to wrap the framework's RawKeyValueIterator inside an implementation of KeyValueIterator without additional data move. This makes sure that anything outside the sorter sees only the simple key, value iterator. The serialized representation stays internal to the sorter black box. The external sorter is also happy as it does not incur any extra data move The abstract base class and MAPREDUCE-279 ----------------------------------------- As I mentioned in my previous post, I created an abstract base class called MapOutputSorterAbstract(I am attaching the source to Jira-2454) in order to access package protected class methods. I would appreciate if developers familiar with MAPREDUCE-279 can take a look at the class and comment on whether the class can live completely outside the framework. In MapTask.java, I needed to change the access of APPROX_HEADER_LENGTH to package public from private. My specific questions are: If TaskReporter, Counter, MapOutputFile be accessible as public classes, how can they be passed to external sorter from MapTask.java? Will CombinerRunner as defined in Task.java be available?(I had to change the access to public.) The classes SpillRecord and IndexRecord should also be made public. Since IndexRecord is not an inner class of SpillRecord, I created another file IndexRecord.java and moved the code there. Reduce ------ On the Reduce side, I was trying to come up with an interface that can be implemented by both the framework as well as an external sorter. It was not easy to decouple shuffling and Merger since the shuffle is driving the Merge not the other way around. Since I wanted to reuse the framework's shuffle code, I ended up using a few ugly if's so that data is shuffled either to the framework's Merger or to an external sorter. If the shuffle code can somehow be invoked from outside the core packages using public interfaces, the external sorter on the Reduce side can just implement a simple key, value iterator. I think this might require some inversion of control and code rewrite in some sensitive areas. On ServiceLoader: ----------------- I thought of taking a simple approach: Users can configure a job with the name of the external sorter classes in configuration parameters like mapred.map.externalsort.class and mapred.reduce.externalsort.class and use simple Java class loader to load the class. This is very similar to configuring a Mapper class for example. Am my missing something? Is there a strong reason to use ServiceLoader? Please give me your feedback. If developers do not get the full picture of what I was playing with, I can try to make my changes locally on top of MAPREDUCE-279 branch and post a patch file. Thanks everyone for your patience. – Asokan
          Hide
          Mariappan Asokan added a comment -

          Sorry, the code was not formatted properly. I have uploaded KeyValueIterator.java as well.

          Show
          Mariappan Asokan added a comment - Sorry, the code was not formatted properly. I have uploaded KeyValueIterator.java as well.
          Hide
          Mariappan Asokan added a comment -

          I thought more on the implementation. Here is what I came up with the steps
          involved. I can create one Jira per each step. If there is any objection from
          anyone, I would like to hear about it before I jump in. In the following, any
          reference to 'framework code' implies current code in Map/Reduce in Hadoop
          taken at the branch MAPREDUCE-279.

          1. Modify the framework code so that RawKeyValueIterator is visible only within
            the sort/merge related code. All others will see the new KeyValueIterator as
            defined below:
            KeyValueIterator.java
            public interface KeyValueIterator<K, V> extends Closeable {
              /**
               * Get the current key.
               * @param key where the current key should be stored.  If this is null, a new
               * instance will be created and returned.
               * @return current key
               * @exception IOException in case of error.
               * @exception InterruptedException in case of interruption.
               */
              K getCurrentKey(K key) throws IOException, InterruptedException;
              
              /**
               * Get the current key.
               * @return current key
               * @exception IOException in case of error.
               * @exception InterruptedException in case of interruption.
               */
              K getCurrentKey() throws IOException, InterruptedException;
              
              /**
               * Get the current value.
               * @param value where the current value should be stored.  If this is null, a
               * new instance will be created and returned.
               * @return current value.
               * @exception IOException in case of error.
               * @exception InterruptedException in case of interruption.
               */
              V getCurrentValue(V value) throws IOException, InterruptedException;
              
              /**
               * Get the current value.
               * @return current value.
               * @exception IOException in case of error.
               * @exception InterruptedException in case of interruption.
               */
              V getCurrentValue() throws IOException, InterruptedException;
              
              /**
               * Set up to get the current key and value (for getKey() and getValue()).
               * @return <code>true</code> if there exists a key/value, <code>false</code>
               * otherwise. 
               * @throws IOException in case of error.
               * @exception InterruptedException in case of interruption.
               */
              boolean nextKeyValue() throws IOException, InterruptedException;
              
              /**
               * Get the Progress object; this has a float (0.0 - 1.0) indicating the bytes
               * processed by the iterator so far.
               * @return progress object.
               */
              Progress getProgress();
            }
            

            This will enable any external sorter implementation to reuse existing code in
            Task.java to run the combiner and the code in ReduceTask.java to run the
            reducer.

          2. Modify the framework code so that an external sorter can be plugged on the
            Map side.
          3. Modify the shuffle code on the Reduce side so that a shuffle can be started
            by any code outside MR(perhaps in a separate thread.) A callback interface
            will be passed to shuffle. Tentatively, the callback will look like this and
            can be refined.
            ShuffleCallback.java
            public interface ShuffleCallback<K, V> extends Closeable
              /**
               * To reserve space for the specified mapper output.
               * @param mapId mapper id.
               * @param requestedSize number of bytes in space to be reserved.
               * @param fetcher id of fetcher that will fetch the map output.
               * @exception IOException in case of error.
               */
              public MapOutput<K,V> reserve(TaskAttemptID mapId, long requestedSize,
                                            int fetcher)
                throws IOException;
            
              /**
               * To shuffle the data from local mappers.
               * @param localMapFiles array of map output files to be sorted.
               * @return total number of bytes read from the mapper outputs.
               * @exception IOException if there is any IO error while reading.
               * @exception InterruptedException if there is an interruption.
               * @exception InterruptedException in case of interruption.
               * @exception ReduceInputSorterException any other exception that occurs while
               * sorting.
               */
              public long shuffle(Path localMapFiles[])
                throws IOException, InterruptedException, ReduceInputSorterException;
            
              /**
               * To shuffle the raw data coming from a non-local mapper.  Multiple threads
               * can call this method with input from different mappers.
               * @param inputFromMapper the raw input stream from the mapper. If map output
               * is compressed, the sorter is responsible for decompressing.
               * @param mapTaskId map task id corresponding to the stream.
               * @param compressedLength The size of the compressed data.
               * @return number of bytes read from the mapper stream.
               * @exception IOException if there is any IO error while reading.
               * @exception InterruptedException if there is an interruption.
               * @exception ReduceInputSorterException any other exception that occurs in
               * the sorter.
               */
              public long shuffle(InputStream inputFromMapper, String mapTaskId,
                                  long compressedLength)
                throws IOException, InterruptedException, ReduceInputSorterException;
            
              /**
               * To commit shuffled data from a non-local mapper.  Usually, this method is
               * called right after shuffle() from the same thread once it is
               * decided to commit.
               * @param mapTaskId map task id corresponding to the shuffled data.
               * @exception IOException if there is any IO error while discarding.
               * @exception InterruptedException in case of interruption.
               * @exception ReduceInputSorterException any other exception that occurs in
               * the sorter.
               */
              public void commit(String mapTaskId)
                throws IOException, InterruptedException, ReduceInputSorterException;
            
              /**
               * To discard shuffled data from a non-local mapper. Usually, this method is
               * called right after shuffle() from the same thread once it is
               * decided to discard.
               * @param mapTaskId map task id corresponding to the shuffled data.
               * @exception IOException if there is any IO error while discarding.
               * @exception InterruptedException in case of interruption.
               * @exception ReduceInputSorterException any other exception that occurs in
               * the sorter.
               */
              public void discard(String mapTaskId)
                throws IOException, InterruptedException, ReduceInputSorterException;
            
              /**
               * To indicate end of input from all non-local mappers.  This should be called
               * after all non-local mapper outputs are committed.
               * @exception IOException if there is any IO error.
               * @exception InterruptedException if there is an interruption.
               * @exception ReduceInputSorterException any other exception that occurs in
               * the sorter.
               */
              public void close()
                throws IOException, InterruptedException, ReduceInputSorterException;
            
          4. Modify ReduceTask.java so that it will invoke the framework's merge or the
            external sorter.
          5. The external sorter interface on the Map side will look like:
            MapOutputSorter.java
            public interface MapOutputSorter<K, V> 
              extends MapOutputCollector<K, V> {
              /**
               * To initialize the sorter.
               * @param job job configuration.
               * @param mapTask map task invoking this sorter.
               * @param inputSplitSize size of input split processed by this sorter.
               * @param mapOutputFile map output file
               * @param reporter reporter to report sorter progress.
               * @exception IOException if there is any error during initialization.
               * @exception ClassNotFoundException if a class to be loaded is not found.
               * @exception UnsupportedOperationException thrown by the sorter if it
               * cannot support certain options in the job.  For example, a sorter may
               * support only a certain subset of key types.  The default sorter in the
               * framework will be used as a fallback when this exception is thrown.
               */
              public void initialize(JobConf job, MapTask mapTask, long inputSplitSize,
                                     MapOutputFile mapOutputFile, TaskReporter reporter)
                throws IOException, ClassNotFoundException, UnsupportedOperationException;
            }
            

            The MapOutputCollector interface defined in MapTask.java will be made public
            and will look like below:

            MapOutputCollector.java
            public interface MapOutputCollector<K, V> extends Closeable{
              public void collect(K key, V value, int partition
                                  ) throws IOException, InterruptedException;
              public void flush() throws IOException, InterruptedException, 
                                         ClassNotFoundException;
              public void close() throws IOException, InterruptedException;
            }
            

            On the Reduce side, the external sorter interface will look like:

            ReduceInputSorter.java
            public interface ReduceInputSorter<K, V> extends KeyValueIterator<K, V> {
              /**
               * Initialize the sorter.
               * @param job job configuration.
               * @param reduceTask reduce task invoking this sorter.
               * @param reporter reporter to report sorter progress.
               * @exception IOException if there is any error during initialization.
               * @exception ClassNotFoundException if a class to be loaded is not found.
               * @exception UnsupportedOperationException thrown by the sorter if it
               * cannot support certain options in the job.  For example, a sorter may
               * support only a certain subset of key types.  The default sorter in the
               * framework will be used as a fallback when this exception is thrown.
               */
              public void initialize(JobConf job, ReduceTask reduceTask,
                                     TaskReporter reporter)
                throws IOException, ClassNotFoundException, UnsupportedOperationException;
            
          6. Some abstract base classes of the above two may be provided in the framework
            to facilitate external sorter implementations.
          7. Provide a proof-of-concept implementation of an external sorter both on the
            Map and Reduce sides using GNU sort command as the external sorter.
            *All the changes mentioned above should not result in any performance
            degradation of framework code when no external sorter is plugged in.*
          Show
          Mariappan Asokan added a comment - I thought more on the implementation. Here is what I came up with the steps involved. I can create one Jira per each step. If there is any objection from anyone, I would like to hear about it before I jump in. In the following, any reference to 'framework code' implies current code in Map/Reduce in Hadoop taken at the branch MAPREDUCE-279 . Modify the framework code so that RawKeyValueIterator is visible only within the sort/merge related code. All others will see the new KeyValueIterator as defined below: KeyValueIterator.java public interface KeyValueIterator<K, V> extends Closeable { /** * Get the current key. * @param key where the current key should be stored. If this is null , a new * instance will be created and returned. * @ return current key * @exception IOException in case of error. * @exception InterruptedException in case of interruption. */ K getCurrentKey(K key) throws IOException, InterruptedException; /** * Get the current key. * @ return current key * @exception IOException in case of error. * @exception InterruptedException in case of interruption. */ K getCurrentKey() throws IOException, InterruptedException; /** * Get the current value. * @param value where the current value should be stored. If this is null , a * new instance will be created and returned. * @ return current value. * @exception IOException in case of error. * @exception InterruptedException in case of interruption. */ V getCurrentValue(V value) throws IOException, InterruptedException; /** * Get the current value. * @ return current value. * @exception IOException in case of error. * @exception InterruptedException in case of interruption. */ V getCurrentValue() throws IOException, InterruptedException; /** * Set up to get the current key and value ( for getKey() and getValue()). * @ return <code> true </code> if there exists a key/value, <code> false </code> * otherwise. * @ throws IOException in case of error. * @exception InterruptedException in case of interruption. */ boolean nextKeyValue() throws IOException, InterruptedException; /** * Get the Progress object; this has a float (0.0 - 1.0) indicating the bytes * processed by the iterator so far. * @ return progress object. */ Progress getProgress(); } This will enable any external sorter implementation to reuse existing code in Task.java to run the combiner and the code in ReduceTask.java to run the reducer. Modify the framework code so that an external sorter can be plugged on the Map side. Modify the shuffle code on the Reduce side so that a shuffle can be started by any code outside MR(perhaps in a separate thread.) A callback interface will be passed to shuffle. Tentatively, the callback will look like this and can be refined. ShuffleCallback.java public interface ShuffleCallback<K, V> extends Closeable /** * To reserve space for the specified mapper output. * @param mapId mapper id. * @param requestedSize number of bytes in space to be reserved. * @param fetcher id of fetcher that will fetch the map output. * @exception IOException in case of error. */ public MapOutput<K,V> reserve(TaskAttemptID mapId, long requestedSize, int fetcher) throws IOException; /** * To shuffle the data from local mappers. * @param localMapFiles array of map output files to be sorted. * @ return total number of bytes read from the mapper outputs. * @exception IOException if there is any IO error while reading. * @exception InterruptedException if there is an interruption. * @exception InterruptedException in case of interruption. * @exception ReduceInputSorterException any other exception that occurs while * sorting. */ public long shuffle(Path localMapFiles[]) throws IOException, InterruptedException, ReduceInputSorterException; /** * To shuffle the raw data coming from a non-local mapper. Multiple threads * can call this method with input from different mappers. * @param inputFromMapper the raw input stream from the mapper. If map output * is compressed, the sorter is responsible for decompressing. * @param mapTaskId map task id corresponding to the stream. * @param compressedLength The size of the compressed data. * @ return number of bytes read from the mapper stream. * @exception IOException if there is any IO error while reading. * @exception InterruptedException if there is an interruption. * @exception ReduceInputSorterException any other exception that occurs in * the sorter. */ public long shuffle(InputStream inputFromMapper, String mapTaskId, long compressedLength) throws IOException, InterruptedException, ReduceInputSorterException; /** * To commit shuffled data from a non-local mapper. Usually, this method is * called right after shuffle() from the same thread once it is * decided to commit. * @param mapTaskId map task id corresponding to the shuffled data. * @exception IOException if there is any IO error while discarding. * @exception InterruptedException in case of interruption. * @exception ReduceInputSorterException any other exception that occurs in * the sorter. */ public void commit( String mapTaskId) throws IOException, InterruptedException, ReduceInputSorterException; /** * To discard shuffled data from a non-local mapper. Usually, this method is * called right after shuffle() from the same thread once it is * decided to discard. * @param mapTaskId map task id corresponding to the shuffled data. * @exception IOException if there is any IO error while discarding. * @exception InterruptedException in case of interruption. * @exception ReduceInputSorterException any other exception that occurs in * the sorter. */ public void discard( String mapTaskId) throws IOException, InterruptedException, ReduceInputSorterException; /** * To indicate end of input from all non-local mappers. This should be called * after all non-local mapper outputs are committed. * @exception IOException if there is any IO error. * @exception InterruptedException if there is an interruption. * @exception ReduceInputSorterException any other exception that occurs in * the sorter. */ public void close() throws IOException, InterruptedException, ReduceInputSorterException; Modify ReduceTask.java so that it will invoke the framework's merge or the external sorter. The external sorter interface on the Map side will look like: MapOutputSorter.java public interface MapOutputSorter<K, V> extends MapOutputCollector<K, V> { /** * To initialize the sorter. * @param job job configuration. * @param mapTask map task invoking this sorter. * @param inputSplitSize size of input split processed by this sorter. * @param mapOutputFile map output file * @param reporter reporter to report sorter progress. * @exception IOException if there is any error during initialization. * @exception ClassNotFoundException if a class to be loaded is not found. * @exception UnsupportedOperationException thrown by the sorter if it * cannot support certain options in the job. For example, a sorter may * support only a certain subset of key types. The default sorter in the * framework will be used as a fallback when this exception is thrown. */ public void initialize(JobConf job, MapTask mapTask, long inputSplitSize, MapOutputFile mapOutputFile, TaskReporter reporter) throws IOException, ClassNotFoundException, UnsupportedOperationException; } The MapOutputCollector interface defined in MapTask.java will be made public and will look like below: MapOutputCollector.java public interface MapOutputCollector<K, V> extends Closeable{ public void collect(K key, V value, int partition ) throws IOException, InterruptedException; public void flush() throws IOException, InterruptedException, ClassNotFoundException; public void close() throws IOException, InterruptedException; } On the Reduce side, the external sorter interface will look like: ReduceInputSorter.java public interface ReduceInputSorter<K, V> extends KeyValueIterator<K, V> { /** * Initialize the sorter. * @param job job configuration. * @param reduceTask reduce task invoking this sorter. * @param reporter reporter to report sorter progress. * @exception IOException if there is any error during initialization. * @exception ClassNotFoundException if a class to be loaded is not found. * @exception UnsupportedOperationException thrown by the sorter if it * cannot support certain options in the job. For example, a sorter may * support only a certain subset of key types. The default sorter in the * framework will be used as a fallback when this exception is thrown. */ public void initialize(JobConf job, ReduceTask reduceTask, TaskReporter reporter) throws IOException, ClassNotFoundException, UnsupportedOperationException; Some abstract base classes of the above two may be provided in the framework to facilitate external sorter implementations. Provide a proof-of-concept implementation of an external sorter both on the Map and Reduce sides using GNU sort command as the external sorter. *All the changes mentioned above should not result in any performance degradation of framework code when no external sorter is plugged in.*
          Hide
          Steve Loughran added a comment -

          Why shouldn't a sorter API also work with the existing sorter. If this is so, then the API and plugin code etc will get tested on every build, whereas a "a proof-of-concept implementation of an external sorter both on the
          Map and Reduce sides using GNU sort command as the external sorter." isn't going to get tested very often, and the GPL license prevents the ASF from redistributing that sorter ignoring all binary library issues.

          Show
          Steve Loughran added a comment - Why shouldn't a sorter API also work with the existing sorter. If this is so, then the API and plugin code etc will get tested on every build, whereas a "a proof-of-concept implementation of an external sorter both on the Map and Reduce sides using GNU sort command as the external sorter." isn't going to get tested very often, and the GPL license prevents the ASF from redistributing that sorter ignoring all binary library issues .
          Hide
          Steve Loughran added a comment -

          I'm also worried about the exception signatures of the interface. One of the hardest things to do in any Java plugin API is getting the list of possible exceptions throw right -if not you end up wrapping everything up.

          Presumably the list of exceptions thrown is derived from your (private) implementation. For example, the initialize() methods throw {{ IOException, ClassNotFoundException, UnsupportedOperationException }}.

          A ClassNotFoundException ClassCastException implies the implementation is possibly playing with classloaders which is a dangerous game and one in which most Java developers, myself included shouldn't be doing. At the very least the exception list should include

          ClassCastException -when the class loads but is the wrong type
          InstantiationException you can't instantiate the class as its constructor isn't there
          NoClassDefFoundError can't find the class though it used to be there
          LinkageError etc.

          The other signatures may work, but the current initialize code is clearly biased towards one single implementation. We'd need more implementation, and some test cases that simulate other failures, to make sure whatever changes go into the MR engine actually handle them, presumably by catching and failing the job. I don't see any point in trying to fall back to another sorter as that would just hide a problem.

          Show
          Steve Loughran added a comment - I'm also worried about the exception signatures of the interface. One of the hardest things to do in any Java plugin API is getting the list of possible exceptions throw right -if not you end up wrapping everything up. Presumably the list of exceptions thrown is derived from your (private) implementation. For example, the initialize() methods throw {{ IOException, ClassNotFoundException, UnsupportedOperationException }}. A ClassNotFoundException ClassCastException implies the implementation is possibly playing with classloaders which is a dangerous game and one in which most Java developers, myself included shouldn't be doing. At the very least the exception list should include ClassCastException -when the class loads but is the wrong type InstantiationException you can't instantiate the class as its constructor isn't there NoClassDefFoundError can't find the class though it used to be there LinkageError etc. The other signatures may work, but the current initialize code is clearly biased towards one single implementation. We'd need more implementation, and some test cases that simulate other failures, to make sure whatever changes go into the MR engine actually handle them, presumably by catching and failing the job. I don't see any point in trying to fall back to another sorter as that would just hide a problem.
          Hide
          Steve Loughran added a comment -

          oh, and the security exceptions that get raised if the class does have a constructor with the right signature but it's private needs to get caught and turned into something else

          Show
          Steve Loughran added a comment - oh, and the security exceptions that get raised if the class does have a constructor with the right signature but it's private needs to get caught and turned into something else
          Hide
          Mariappan Asokan added a comment -

          Hi Steve,
          Thank you very much for your comments. I will try to make the sorting done on Map and Reduce side as pluggable. The default implementation will be whatever is available in the framework. It is easy to separate the sorting process on the Map side(currently all the code is in the class MapOutputBuffer which lives in MapTask.java.) It is very hard to separate the merge on the Reduce side because of the way it is coded. I am working to separate that as well.

          Regarding GNU sort plugin, I am making the external sort command name configurable. It can be POSIX sort command as well. Since most Hadoop installations are Linux based, GNU sort is available as the POSIX sort implementation. Other UNIX installations can use the POSIX sort command as an external sorter. There is no GPL issue. Perhaps, I can remove the word GNU and just call it UNIX.

          Regarding class loader related exceptions: I will look at framework's code and see what it does when it loads a Mapper or Reducer class and follow the same since the scenario is very similar. All issues you have raised w.r.t class loading are applicable there as well.

          An explanation on UnsupportedOperationException: If the external sorter uses a UNIX command like sort, it may not be able to handle a custom key type user has defined since the key comparator may be written in Java. In such a case there will be message logged in syslog and the framework's sorter will be used. I think this is fair enough. Please let me know if you think otherwise.

          When I am done with the implementation(on top of MAPREDUCE-279) and testing, I will post a patch file for review. Would you be interested to work with me as a committer?

          Thank you.

          Show
          Mariappan Asokan added a comment - Hi Steve, Thank you very much for your comments. I will try to make the sorting done on Map and Reduce side as pluggable. The default implementation will be whatever is available in the framework. It is easy to separate the sorting process on the Map side(currently all the code is in the class MapOutputBuffer which lives in MapTask.java.) It is very hard to separate the merge on the Reduce side because of the way it is coded. I am working to separate that as well. Regarding GNU sort plugin, I am making the external sort command name configurable. It can be POSIX sort command as well. Since most Hadoop installations are Linux based, GNU sort is available as the POSIX sort implementation. Other UNIX installations can use the POSIX sort command as an external sorter. There is no GPL issue. Perhaps, I can remove the word GNU and just call it UNIX. Regarding class loader related exceptions: I will look at framework's code and see what it does when it loads a Mapper or Reducer class and follow the same since the scenario is very similar. All issues you have raised w.r.t class loading are applicable there as well. An explanation on UnsupportedOperationException: If the external sorter uses a UNIX command like sort, it may not be able to handle a custom key type user has defined since the key comparator may be written in Java. In such a case there will be message logged in syslog and the framework's sorter will be used. I think this is fair enough. Please let me know if you think otherwise. When I am done with the implementation(on top of MAPREDUCE-279 ) and testing, I will post a patch file for review. Would you be interested to work with me as a committer? Thank you.
          Hide
          Owen O'Malley added a comment - - edited

          Actually, I think I made a mistake in pushing the objects into the interface, especially since I plan to change the serialization layer. I think it would be better to do:

          RawRecordWriter
          package org.apache.hadoop.mapreduce.task;
          
          public abstract class RawRecordWriter implements Closeable {
            /**
             * Called once at start of processing
             */
            public abstract void initialize(TaskAttemptContext context
                                            ) throws IOException, InterruptedException;
          
            /**
             * Called once per a record. The key and value will be copied before write returns.
             */
            public abstract void write(int partition, ByteBuffer key, ByteBuffer value
                                       ) throws IOException, InterruptedException;
          
            /**
             * Called once at task finish or failure.
             */
            public abstract void close() throws IOException;
          }
          

          For the Reduce side, we could just use the RawKeyValueIterator, but I suspect we'll be in
          better shape if we do something similar:

          RawRecordReader.java
          package org.apache.hadoop.mapreduce.task;
          
          public abstract class RawRecordReader implements Closeable {
            /**
             * Called once at start of processing
             */
            public abstract void initialize(TaskAttemptContext context
                                            ) throws IOException, InterruptedException;
          
            /**
             * Advance to the next record. Returns false when there are no more records.
             */
            pubic abstract boolean next() throws IOException, InterruptedException;
          
            /**
             * Provides the ByteBuffer with the key. The ByteBuffer may be reused after each call to
             * next.
             */
            public abstract ByteBuffer getKey() throws IOException, InterruptedException;
          
            /**
             * Provides the ByteBuffer with the value. The ByteBuffer may be reused after each call to
             * next.
             */
            public abstract ByteBuffer getValue() throws IOException, InterruptedException;
          
            /**
             * Called once at task finish or failure.
             */
            public abstract void close() throws IOException;  
          }
          

          This has a couple of advantages:

          • The plugin gets the TaskAttemptContext and the configuration.
          • Serialization stays part of MapReduce instead of the sort library.
          Show
          Owen O'Malley added a comment - - edited Actually, I think I made a mistake in pushing the objects into the interface, especially since I plan to change the serialization layer. I think it would be better to do: RawRecordWriter package org.apache.hadoop.mapreduce.task; public abstract class RawRecordWriter implements Closeable { /** * Called once at start of processing */ public abstract void initialize(TaskAttemptContext context ) throws IOException, InterruptedException; /** * Called once per a record. The key and value will be copied before write returns. */ public abstract void write( int partition, ByteBuffer key, ByteBuffer value ) throws IOException, InterruptedException; /** * Called once at task finish or failure. */ public abstract void close() throws IOException; } For the Reduce side, we could just use the RawKeyValueIterator, but I suspect we'll be in better shape if we do something similar: RawRecordReader.java package org.apache.hadoop.mapreduce.task; public abstract class RawRecordReader implements Closeable { /** * Called once at start of processing */ public abstract void initialize(TaskAttemptContext context ) throws IOException, InterruptedException; /** * Advance to the next record. Returns false when there are no more records. */ pubic abstract boolean next() throws IOException, InterruptedException; /** * Provides the ByteBuffer with the key. The ByteBuffer may be reused after each call to * next. */ public abstract ByteBuffer getKey() throws IOException, InterruptedException; /** * Provides the ByteBuffer with the value. The ByteBuffer may be reused after each call to * next. */ public abstract ByteBuffer getValue() throws IOException, InterruptedException; /** * Called once at task finish or failure. */ public abstract void close() throws IOException; } This has a couple of advantages: The plugin gets the TaskAttemptContext and the configuration. Serialization stays part of MapReduce instead of the sort library.
          Hide
          Owen O'Malley added a comment -

          I should also comment that the initialize method would allow the plugin to start any additional
          services that it needs.

          MapOutputCollector's flush is a method that should be controlled by the sorter and not the framework.

          Show
          Owen O'Malley added a comment - I should also comment that the initialize method would allow the plugin to start any additional services that it needs. MapOutputCollector's flush is a method that should be controlled by the sorter and not the framework.
          Hide
          Steve Loughran added a comment -

          > Would you be interested to work with me as a committer?

          I am too overloaded with non-Hadoop work I don't get the time to get my my own patches up to date with MAPREDUCE-279, let alone work on other things. Sorry

          I recommend you get involved on the mapreduce-dev list, understand what's being discussed (if you subscribe to the mapreduce-issues list you'll get all JIRA changes), and so get involved in the bigger picture of where things are going -and get to know the people who do know their way round the codebase better than me.

          Show
          Steve Loughran added a comment - > Would you be interested to work with me as a committer? I am too overloaded with non-Hadoop work I don't get the time to get my my own patches up to date with MAPREDUCE-279 , let alone work on other things. Sorry I recommend you get involved on the mapreduce-dev list, understand what's being discussed (if you subscribe to the mapreduce-issues list you'll get all JIRA changes), and so get involved in the bigger picture of where things are going -and get to know the people who do know their way round the codebase better than me.
          Hide
          Mariappan Asokan added a comment -

          Hi Owen,
          Thanks for your comments. I like your suggestion on the signature of initialize() method and also not having a flush(). However, I prefer to pass the Key and Value as objects instead of serialized ByteArray for the following reasons:

          • It is easier and more efficient when external program(like UNIX sort command) is invoked as a sorter. The Key and Value types will be Text. The bytes in the Text can be grabbed and passed to the program with a TAB between them. There is no need to deserialize data passed in the ByteArray. This is similar to what is happening with hadoop streaming when for example a Mapper is implemented by an external program. Also, on the Map side the output of the mapper is key and value objects which can be directly passed to the sorter. Thus there is no need for extra serializtion/deserialization. Similar argument applies when output of the sorter is read on the Reduce side using RecordReader.
          • The framework's serialization is in no way affected. It is free to replace the serialization layer. The external sorter can store the sorted output as simple UNIX text records in the final map output file since it will deal with the shuffled data on the Reduce side.
          • For the RecordReader, I think it is better to change the signature of getKey() and getValue() as below:
            RecordReader
            Object getKey(Object key) // If key is null, it will be allocated first.
            Object getValue(Object value) // If value is null, it will allocated first.
            

            The reasons for these signatures are:

            • The RecordReader will be used for running Combiner and Reducer. This may involve saving the last seen key. If the caller passes the key object, it can just save the object handle not the entire object since it owns the object. If the callee is returning its own object, it is ephemeral and so the caller has to save it which results in extra copying.
            • Creating an adapter to return key and value objects from their serialized counterparts(that is from RawKeyValueIterator) will not result in any extra data copying. So the performance of the framework's sorter will not degrade.

          Owen, do you have any suggestion on a committer with whom I can work on this?
          Thanks.

          Show
          Mariappan Asokan added a comment - Hi Owen, Thanks for your comments. I like your suggestion on the signature of initialize() method and also not having a flush(). However, I prefer to pass the Key and Value as objects instead of serialized ByteArray for the following reasons: It is easier and more efficient when external program(like UNIX sort command) is invoked as a sorter. The Key and Value types will be Text. The bytes in the Text can be grabbed and passed to the program with a TAB between them. There is no need to deserialize data passed in the ByteArray. This is similar to what is happening with hadoop streaming when for example a Mapper is implemented by an external program. Also, on the Map side the output of the mapper is key and value objects which can be directly passed to the sorter. Thus there is no need for extra serializtion/deserialization. Similar argument applies when output of the sorter is read on the Reduce side using RecordReader. The framework's serialization is in no way affected. It is free to replace the serialization layer. The external sorter can store the sorted output as simple UNIX text records in the final map output file since it will deal with the shuffled data on the Reduce side. For the RecordReader, I think it is better to change the signature of getKey() and getValue() as below: RecordReader Object getKey( Object key) // If key is null , it will be allocated first. Object getValue( Object value) // If value is null , it will allocated first. The reasons for these signatures are: The RecordReader will be used for running Combiner and Reducer. This may involve saving the last seen key. If the caller passes the key object, it can just save the object handle not the entire object since it owns the object. If the callee is returning its own object, it is ephemeral and so the caller has to save it which results in extra copying. Creating an adapter to return key and value objects from their serialized counterparts(that is from RawKeyValueIterator) will not result in any extra data copying. So the performance of the framework's sorter will not degrade. Owen, do you have any suggestion on a committer with whom I can work on this? Thanks.
          Hide
          Owen O'Malley added a comment -

          The map output key and value types are controlled by the application, not the framework. A plugin that can only sort Text objects isn't general purpose enough. Even streaming created a lot of trouble for the users by requiring UTF-8 encoding of the data.

          The only acceptable solution would be to define this API and refactor the current code into a default plugin.

          I hadn't thought enough about the combiner. It requires an inversion of control since the start of the combiner happens based on the spill.

          SortPlugin
          package org.apache.hadoop.mapreduce.task;
          
          public abstract class SortPlugin {
          
            public interface CombinerCallback {
              /** Called once for each partition of the map output */
              void runCombiner(RawRecordReader reader,
                               RawRecordWriter writer
                              ) throws IOException, InterruptedException;
            }
          
            /** Called once in map task for collector to gather
                output coming from map. */
            public abstract RawRecordWriter createRawRecordWriter()
              throws IOException, InterruptedException;
          
            /** Called once in the map task, if there is a combiner. */
            public abstract void registerCombinerCallback(CombinerCallback callback)
              throws IOException, InterruptedException;
          
            /** Called once in the reduce task for iterator to provide
                input to the reduce. */ 
            public abstract RawRecordReader createRawRecordReader() 
              throws IOException, InterruptedException;
          }
          
          Show
          Owen O'Malley added a comment - The map output key and value types are controlled by the application, not the framework. A plugin that can only sort Text objects isn't general purpose enough. Even streaming created a lot of trouble for the users by requiring UTF-8 encoding of the data. The only acceptable solution would be to define this API and refactor the current code into a default plugin. I hadn't thought enough about the combiner. It requires an inversion of control since the start of the combiner happens based on the spill. SortPlugin package org.apache.hadoop.mapreduce.task; public abstract class SortPlugin { public interface CombinerCallback { /** Called once for each partition of the map output */ void runCombiner(RawRecordReader reader, RawRecordWriter writer ) throws IOException, InterruptedException; } /** Called once in map task for collector to gather output coming from map. */ public abstract RawRecordWriter createRawRecordWriter() throws IOException, InterruptedException; /** Called once in the map task, if there is a combiner. */ public abstract void registerCombinerCallback(CombinerCallback callback) throws IOException, InterruptedException; /** Called once in the reduce task for iterator to provide input to the reduce. */ public abstract RawRecordReader createRawRecordReader() throws IOException, InterruptedException; }
          Hide
          Mariappan Asokan added a comment -

          Thanks for all the comments I have received so far.

          I am highlighting the changes and additions made. Please look at the attached
          patch file MR-2454-trunkPatchPreview.gz for details. The patch file reflects
          the changes made on the trunk revision. I will post a patch file for MR-279
          branch later once I have the build working.

          I would like to receive feedback from all developers and especially the ones who
          worked on the files:

          Task.java, MapTask.java, and ReduceTask.java.

          We will start testing the changes once I receive the feedback.

          OVERALL

          • The interfaces SortPlugin, MapSortPlugin, and ReduceSortPlugin were added to
            facilitate sort plugin implementations.
          • The framework code was refactored to implement DefaultSortPlugin,
            DefaultMapSortPlugin, and DefaultReduceSortPlugin.
          • The shuffle code was decoupled from the framework's merge so that shuffle can
            be used by other sort plugin implementations.
          • An implementation of an external sort plugin was added under
            contrib/sortplugin directory. It uses Unix sort command to sort the keys. This
            is a work in progress. Only the UnixMapSortPlugin is currently implemented.
          • New interfaces SortinRecordWriter and SortoutRecordReader were introduced.
            The sort plugins will provide implementations of these interfaces.
          Task.java
          • Several useful static classes that were present in this file were taken out
            and separate files containg corresponding public classes were created under
            hadoop/mapreduce/task.
          MapTask.java
          • The sort code in MapOutputBuffer class was taken out of this file and it will
            live in DefaultMapSortPlugin.java under hadoop/mapreduce/task/map.
          ReduceTask.java
          • Sort related code was taken out of this file and it will live in
            DefaultReduceSortPlugin.java under hadoop/mapreduce/task/reduce.
          • Helper methods to create ShuffleRunner and MergeManager instances were added.
          Shuffle.java
          MergeManager.java
          • A new interface ShuffleRunner was introduced and Shuffle will implement this
            interface.
          • A new interface ShuffleCallback was introduced and will be implemented by
            MergeManager.
          • The Shuffle class will be dealing with shuffle only. The MergeManager will no
            longer be instantiated by Shuffle.
          • An implementation of ShuffleCallback interface will be passed to the run()
            method.
          • MergeManager which implements ShuffleCallback will be instantiated by
            DefaultReduceSortPlugin.
          • Any other reduce sort plugin implementation will need to implement
            ShuffleCallback interface outside the framework.
          • Shuffle class will no longer be taking <K, V> as generic parameters.
          Fetcher.java
          ShuffleScheduler.java
          EventFetcher.java
          • The classes in these files will no longer be taking <K, V> as generic
            parameters.
          • Fetcher will receive a ShuffleCallback object as opposed to a
            MergeManager instance.
          • Fetcher will delegate the responsibility of copying shuffled data to one of
            concrete implementations of MapOutput.
          MapOutput.java
          • The code in this file was refactored so that the class MapOutput will be
            abstract with the concrete implementations OnDiskMapOutput and
            InMemoryMapOutput created from the original MapOutput.java and Fetcher.java.
            These concrete implementations will be used by MergeManager.
          • MapOutput class will no longer be burdened with carrying unrelated information
            (MEMORY, DISK, and WAIT.)
          • Any other reduce sort plugin implementation will need to provide a concrete
            implementation of MapOutput class outside the framework.
          Show
          Mariappan Asokan added a comment - Thanks for all the comments I have received so far. I am highlighting the changes and additions made. Please look at the attached patch file MR-2454-trunkPatchPreview.gz for details. The patch file reflects the changes made on the trunk revision. I will post a patch file for MR-279 branch later once I have the build working. I would like to receive feedback from all developers and especially the ones who worked on the files: Task.java, MapTask.java, and ReduceTask.java. We will start testing the changes once I receive the feedback. OVERALL The interfaces SortPlugin, MapSortPlugin, and ReduceSortPlugin were added to facilitate sort plugin implementations. The framework code was refactored to implement DefaultSortPlugin, DefaultMapSortPlugin, and DefaultReduceSortPlugin. The shuffle code was decoupled from the framework's merge so that shuffle can be used by other sort plugin implementations. An implementation of an external sort plugin was added under contrib/sortplugin directory. It uses Unix sort command to sort the keys. This is a work in progress. Only the UnixMapSortPlugin is currently implemented. New interfaces SortinRecordWriter and SortoutRecordReader were introduced. The sort plugins will provide implementations of these interfaces. Task.java Several useful static classes that were present in this file were taken out and separate files containg corresponding public classes were created under hadoop/mapreduce/task. MapTask.java The sort code in MapOutputBuffer class was taken out of this file and it will live in DefaultMapSortPlugin.java under hadoop/mapreduce/task/map. ReduceTask.java Sort related code was taken out of this file and it will live in DefaultReduceSortPlugin.java under hadoop/mapreduce/task/reduce. Helper methods to create ShuffleRunner and MergeManager instances were added. Shuffle.java MergeManager.java A new interface ShuffleRunner was introduced and Shuffle will implement this interface. A new interface ShuffleCallback was introduced and will be implemented by MergeManager. The Shuffle class will be dealing with shuffle only. The MergeManager will no longer be instantiated by Shuffle. An implementation of ShuffleCallback interface will be passed to the run() method. MergeManager which implements ShuffleCallback will be instantiated by DefaultReduceSortPlugin. Any other reduce sort plugin implementation will need to implement ShuffleCallback interface outside the framework. Shuffle class will no longer be taking <K, V> as generic parameters. Fetcher.java ShuffleScheduler.java EventFetcher.java The classes in these files will no longer be taking <K, V> as generic parameters. Fetcher will receive a ShuffleCallback object as opposed to a MergeManager instance. Fetcher will delegate the responsibility of copying shuffled data to one of concrete implementations of MapOutput. MapOutput.java The code in this file was refactored so that the class MapOutput will be abstract with the concrete implementations OnDiskMapOutput and InMemoryMapOutput created from the original MapOutput.java and Fetcher.java. These concrete implementations will be used by MergeManager. MapOutput class will no longer be burdened with carrying unrelated information (MEMORY, DISK, and WAIT.) Any other reduce sort plugin implementation will need to provide a concrete implementation of MapOutput class outside the framework.
          Hide
          Mariappan Asokan added a comment -

          Just a reminder. I would like to receive any comments on the proposed patch to support a sort plugin so that I can continue with the Unix sort plugin implementation and the testing.
          Thanks.

          Show
          Mariappan Asokan added a comment - Just a reminder. I would like to receive any comments on the proposed patch to support a sort plugin so that I can continue with the Unix sort plugin implementation and the testing. Thanks.
          Hide
          Mariappan Asokan added a comment -

          Uploaded a short document(HadoopSortPlugin.pdf) on sort plugin. Please provide your feedback.

          Show
          Mariappan Asokan added a comment - Uploaded a short document(HadoopSortPlugin.pdf) on sort plugin. Please provide your feedback.
          Hide
          Mariappan Asokan added a comment -

          Attached a patch on top of MR-279 branch

          Show
          Mariappan Asokan added a comment - Attached a patch on top of MR-279 branch
          Hide
          Mariappan Asokan added a comment -

          Attached a patch on top of the trunk. All tests run by maven passed.

          I would appreciate if a committer can take a look at the patch and help push it
          into the trunk.

          In the meantime, I am working on a NullSortPlugin implementation. If there is
          enough interest, I can add it to the patch later. The idea of a NullSortPlugin
          is to not sort the map output at all! On the Reduce side, the shuffled records
          will be passed to the Reducer without any merging.

          The NullSortPlugin is useful to solve limit-N query problem that does not
          require order-by(for a complete description of the problem, please refer to
          HIVE-2004 and MAPREDUCE-1928.) The idea is to stop an MR job that does simple
          filtering(no sorting needed) after a certain number of records are selected.

          The steps involved are as follows:

          • Set the parameters mapred.map.max.attempts and mapred.reduce.max.attempts to
            1, set mapred.reduce.slowstart.completed.maps to 0, and set the number of
            reducers to 1 for an MR job.
          • Implement a Mapper that does a condition based filtering.
          • Implement a Reducer which would output only the first N records, print out a
            diagnostic message in the log and throw an exception to stop the MR job.

          The first step makes sure that the job is not restarted and the reducer is
          started right away. The third step limits the number of Map tasks started(which
          is the desired goal.) The diagnostic message in the log will be useful to find
          out whether the job aborted or completed normally.

          Show
          Mariappan Asokan added a comment - Attached a patch on top of the trunk. All tests run by maven passed. I would appreciate if a committer can take a look at the patch and help push it into the trunk. In the meantime, I am working on a NullSortPlugin implementation. If there is enough interest, I can add it to the patch later. The idea of a NullSortPlugin is to not sort the map output at all! On the Reduce side, the shuffled records will be passed to the Reducer without any merging. The NullSortPlugin is useful to solve limit-N query problem that does not require order-by(for a complete description of the problem, please refer to HIVE-2004 and MAPREDUCE-1928 .) The idea is to stop an MR job that does simple filtering(no sorting needed) after a certain number of records are selected. The steps involved are as follows: Set the parameters mapred.map.max.attempts and mapred.reduce.max.attempts to 1, set mapred.reduce.slowstart.completed.maps to 0, and set the number of reducers to 1 for an MR job. Implement a Mapper that does a condition based filtering. Implement a Reducer which would output only the first N records, print out a diagnostic message in the log and throw an exception to stop the MR job. The first step makes sure that the job is not restarted and the reducer is started right away. The third step limits the number of Map tasks started(which is the desired goal.) The diagnostic message in the log will be useful to find out whether the job aborted or completed normally.
          Hide
          Deepak Jain added a comment -

          Asokan,

          Keep up the good work. Wish you happy new year.

          Regards,
          Deepak Jain
          650-208-8790(c)

          Sent from my iPhone

          Show
          Deepak Jain added a comment - Asokan, Keep up the good work. Wish you happy new year. Regards, Deepak Jain 650-208-8790(c) Sent from my iPhone
          Hide
          Mariappan Asokan added a comment -

          The patch had been tested with the trunk revision.

          Show
          Mariappan Asokan added a comment - The patch had been tested with the trunk revision.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12530864/mapreduce-2454.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2437//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2437//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-core.html
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2437//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12530864/mapreduce-2454.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. -1 findbugs. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2437//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2437//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-core.html Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2437//console This message is automatically generated.
          Hide
          Mariappan Asokan added a comment -

          Fixed findbugs and javadoc warnings.

          Show
          Mariappan Asokan added a comment - Fixed findbugs and javadoc warnings.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12531175/mapreduce-2454.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient:

          org.apache.hadoop.mapred.TestReduceFetchFromPartialMem

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2442//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2442//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12531175/mapreduce-2454.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient: org.apache.hadoop.mapred.TestReduceFetchFromPartialMem +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2442//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2442//console This message is automatically generated.
          Hide
          Mariappan Asokan added a comment -

          The failing test seems to be a flaky one. Googling on org.apache.hadoop.mapred.TestReduceFetchFromPartialMem shows a lot of hits in mapreduce jira. I will look at the test more closely to see whether it can be fixed. I welcome input from other developers on this. Meanwhile, I can retry the same patch file to see whether this failure goes away magically.

          Show
          Mariappan Asokan added a comment - The failing test seems to be a flaky one. Googling on org.apache.hadoop.mapred.TestReduceFetchFromPartialMem shows a lot of hits in mapreduce jira. I will look at the test more closely to see whether it can be fixed. I welcome input from other developers on this. Meanwhile, I can retry the same patch file to see whether this failure goes away magically.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12531752/mapreduce-2454.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient:

          org.apache.hadoop.io.file.tfile.TestTFileByteArrays
          org.apache.hadoop.io.file.tfile.TestTFileJClassComparatorByteArrays
          org.apache.hadoop.fs.viewfs.TestViewFsTrash
          org.apache.hadoop.hdfs.server.namenode.TestProcessCorruptBlocks
          org.apache.hadoop.hdfs.TestDatanodeBlockScanner
          org.apache.hadoop.mapred.TestReduceFetchFromPartialMem

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2452//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2452//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12531752/mapreduce-2454.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient: org.apache.hadoop.io.file.tfile.TestTFileByteArrays org.apache.hadoop.io.file.tfile.TestTFileJClassComparatorByteArrays org.apache.hadoop.fs.viewfs.TestViewFsTrash org.apache.hadoop.hdfs.server.namenode.TestProcessCorruptBlocks org.apache.hadoop.hdfs.TestDatanodeBlockScanner org.apache.hadoop.mapred.TestReduceFetchFromPartialMem +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2452//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2452//console This message is automatically generated.
          Hide
          Mariappan Asokan added a comment -

          Trying one more time...

          Show
          Mariappan Asokan added a comment - Trying one more time...
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12533362/mapreduce-2454.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient:

          org.apache.hadoop.io.file.tfile.TestTFileByteArrays
          org.apache.hadoop.io.file.tfile.TestTFileJClassComparatorByteArrays
          org.apache.hadoop.mapred.TestReduceFetch

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2507//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2507//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12533362/mapreduce-2454.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient: org.apache.hadoop.io.file.tfile.TestTFileByteArrays org.apache.hadoop.io.file.tfile.TestTFileJClassComparatorByteArrays org.apache.hadoop.mapred.TestReduceFetch +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2507//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2507//console This message is automatically generated.
          Hide
          Arun C Murthy added a comment -

          Mariappan - Sorry it took me so long, and thanks for being super patient! I'm clearing my review backlog and I'll get to this very soon! Meanwhile, can you pls look at the test failures? Thanks again!

          Show
          Arun C Murthy added a comment - Mariappan - Sorry it took me so long, and thanks for being super patient! I'm clearing my review backlog and I'll get to this very soon! Meanwhile, can you pls look at the test failures? Thanks again!
          Hide
          Mariappan Asokan added a comment -

          Thanks for your comments Arun. I will start looking at the failing tests.

          – Asokan

          Show
          Mariappan Asokan added a comment - Thanks for your comments Arun. I will start looking at the failing tests. – Asokan
          Hide
          Mariappan Asokan added a comment -

          Hi Arun,
          HADOOP-8537 is addressing the issue with TFile tests failing. I will start looking at TestReduceFetch test.

          Show
          Mariappan Asokan added a comment - Hi Arun, HADOOP-8537 is addressing the issue with TFile tests failing. I will start looking at TestReduceFetch test.
          Hide
          Mariappan Asokan added a comment -

          Hi Arun,
          I came across MAPREDUCE-1392 that mentions TestReduceFetch failures. I was able to reproduce intermittent failures when I ran the test repeatedly on my laptop. I may be wrong but it appears to be related to some timing issue. I will investigate how the test can be stabilized so that it is rerunnable. If I find anything or have a proposal to fix this issue, I will add my comments in MAPREDUCE-1392.

          Show
          Mariappan Asokan added a comment - Hi Arun, I came across MAPREDUCE-1392 that mentions TestReduceFetch failures. I was able to reproduce intermittent failures when I ran the test repeatedly on my laptop. I may be wrong but it appears to be related to some timing issue. I will investigate how the test can be stabilized so that it is rerunnable. If I find anything or have a proposal to fix this issue, I will add my comments in MAPREDUCE-1392 .
          Hide
          Mariappan Asokan added a comment -

          I think I found the reason why TestReduceFetch test was failing. The line:

              commitMemory -= size;
          

          in the method unreserve() in MergeManager.java was missing in my patch. I have corrected this problem and resubmitted the patch.

          Show
          Mariappan Asokan added a comment - I think I found the reason why TestReduceFetch test was failing. The line: commitMemory -= size; in the method unreserve() in MergeManager.java was missing in my patch. I have corrected this problem and resubmitted the patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12537540/mapreduce-2454.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient:

          org.apache.hadoop.mapreduce.lib.input.TestCombineFileInputFormat

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2647//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2647//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12537540/mapreduce-2454.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient: org.apache.hadoop.mapreduce.lib.input.TestCombineFileInputFormat +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2647//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2647//console This message is automatically generated.
          Hide
          Sean Zhang added a comment -

          Asokan,
          Your patch is based on revision 1363960 which is in between the following two builds of the trunk.
          https://builds.apache.org/view/Hadoop/job/Hadoop-Mapreduce-trunk/1142/ (1363576)
          https://builds.apache.org/view/Hadoop/job/Hadoop-Mapreduce-trunk/1143/ (1364020)

          Both of the above builds failed with the following:
          Running org.apache.hadoop.mapreduce.lib.input.TestCombineFileInputFormat
          Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.385 sec <<< FAILURE!

          If I am not mistaken, most likely the failure is not caused by your change. You just need to pick a clean base. What do you think?

          Show
          Sean Zhang added a comment - Asokan, Your patch is based on revision 1363960 which is in between the following two builds of the trunk. https://builds.apache.org/view/Hadoop/job/Hadoop-Mapreduce-trunk/1142/ (1363576) https://builds.apache.org/view/Hadoop/job/Hadoop-Mapreduce-trunk/1143/ (1364020) Both of the above builds failed with the following: Running org.apache.hadoop.mapreduce.lib.input.TestCombineFileInputFormat Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.385 sec <<< FAILURE! If I am not mistaken, most likely the failure is not caused by your change. You just need to pick a clean base. What do you think?
          Hide
          Mariappan Asokan added a comment -

          Hi Sean Zhang,
          You may be right. I have been looking at this test and it does not seem to be related to my changes. I will try with a clean base. Thanks.

          Show
          Mariappan Asokan added a comment - Hi Sean Zhang, You may be right. I have been looking at this test and it does not seem to be related to my changes. I will try with a clean base. Thanks.
          Hide
          Mariappan Asokan added a comment -

          Merged on top of trunk version 1361565.

          Show
          Mariappan Asokan added a comment - Merged on top of trunk version 1361565.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12537708/mapreduce-2454.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient:

          org.apache.hadoop.mapreduce.lib.input.TestCombineFileInputFormat

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2652//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2652//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12537708/mapreduce-2454.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient: org.apache.hadoop.mapreduce.lib.input.TestCombineFileInputFormat +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2652//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2652//console This message is automatically generated.
          Hide
          Mariappan Asokan added a comment -

          MAPREDUCE-4470 addresses this test failure. I am waiting for a fix.

          Show
          Mariappan Asokan added a comment - MAPREDUCE-4470 addresses this test failure. I am waiting for a fix.
          Hide
          Alejandro Abdelnur added a comment -

          Mariappan, the latest patch does not apply to trunk, would you post a new patch rebased to trunk? Thx

          Show
          Alejandro Abdelnur added a comment - Mariappan, the latest patch does not apply to trunk, would you post a new patch rebased to trunk? Thx
          Hide
          Mariappan Asokan added a comment -

          Hi Alejandro,
          Thanks for the reminder. I picked up the changes in Fetcher.java and TestFetcher.java. The test TestCombineFileInputFormat will still fail

          Show
          Mariappan Asokan added a comment - Hi Alejandro, Thanks for the reminder. I picked up the changes in Fetcher.java and TestFetcher.java. The test TestCombineFileInputFormat will still fail
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12538585/mapreduce-2454.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient:

          org.apache.hadoop.mapreduce.lib.input.TestCombineFileInputFormat

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2687//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2687//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12538585/mapreduce-2454.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient: org.apache.hadoop.mapreduce.lib.input.TestCombineFileInputFormat +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2687//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2687//console This message is automatically generated.
          Hide
          Mariappan Asokan added a comment -

          MAPREDUCE-4470 has been committed. The test TestCombineFileInputFormat should no longer be failing.

          Show
          Mariappan Asokan added a comment - MAPREDUCE-4470 has been committed. The test TestCombineFileInputFormat should no longer be failing.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12542197/mapreduce-2454.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient:

          org.apache.hadoop.hdfs.TestHftpDelegationToken

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2773//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2773//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2773//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12542197/mapreduce-2454.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient: org.apache.hadoop.hdfs.TestHftpDelegationToken +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2773//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2773//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2773//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12542643/mapreduce-2454.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 eclipse:eclipse. The patch failed to build with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient:

          org.apache.hadoop.hdfs.TestHftpDelegationToken

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2775//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2775//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2775//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12542643/mapreduce-2454.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. -1 eclipse:eclipse. The patch failed to build with eclipse:eclipse. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient: org.apache.hadoop.hdfs.TestHftpDelegationToken +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2775//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2775//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2775//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12542841/mapreduce-2454.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient:

          org.apache.hadoop.hdfs.TestHftpDelegationToken
          org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2787//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2787//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2787//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12542841/mapreduce-2454.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient: org.apache.hadoop.hdfs.TestHftpDelegationToken org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2787//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2787//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2787//console This message is automatically generated.
          Hide
          Arun C Murthy added a comment -

          Asokan - I started looking at this patch. Can you pls provide a brief overview of your design, particularly the new interfaces you are adding and how they play? Is your old pdf still valid? Tx.

          Show
          Arun C Murthy added a comment - Asokan - I started looking at this patch. Can you pls provide a brief overview of your design, particularly the new interfaces you are adding and how they play? Is your old pdf still valid? Tx.
          Hide
          Mariappan Asokan added a comment -

          Hi Arun,
          Sure, I will expand the pdf documentation I posted long ago and upload an updated version shortly.

          I also created a test which implements what I call as a NullSortPlugin. I will copy this test as part of the patch. This implementation is very simple and it basically demonstrates how sorting can be avoided to solve limit-N query problem described in HIVE-2004 and MAPREDUCE-1928.

          Show
          Mariappan Asokan added a comment - Hi Arun, Sure, I will expand the pdf documentation I posted long ago and upload an updated version shortly. I also created a test which implements what I call as a NullSortPlugin. I will copy this test as part of the patch. This implementation is very simple and it basically demonstrates how sorting can be avoided to solve limit-N query problem described in HIVE-2004 and MAPREDUCE-1928 .
          Hide
          Mariappan Asokan added a comment -

          Hi Arun,
          I have attached the latest patch that includes a test. I have also uploaded the latest pdf document on the plugin in order to make the code review somewhat easier. If you want more details in the document, please let me know.

          It appears that some tests unrelated to this Jira have been failing. I am not able to pick up a clean trunk to work with.

          Thanks for your patience.

          – Asokan

          Show
          Mariappan Asokan added a comment - Hi Arun, I have attached the latest patch that includes a test. I have also uploaded the latest pdf document on the plugin in order to make the code review somewhat easier. If you want more details in the document, please let me know. It appears that some tests unrelated to this Jira have been failing. I am not able to pick up a clean trunk to work with. Thanks for your patience. – Asokan
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12543168/HadoopSortPlugin.pdf
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2794//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12543168/HadoopSortPlugin.pdf against trunk revision . -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2794//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12543174/mapreduce-2454.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified test files.

          -1 javac. The applied patch generated 2066 javac compiler warnings (more than the trunk's current 2059 warnings).

          -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient:

          org.apache.hadoop.hdfs.TestHftpDelegationToken

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2795//testReport/
          Javac warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2795//artifact/trunk/patchprocess/diffJavacWarnings.txt
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2795//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12543174/mapreduce-2454.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified test files. -1 javac. The applied patch generated 2066 javac compiler warnings (more than the trunk's current 2059 warnings). -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient: org.apache.hadoop.hdfs.TestHftpDelegationToken +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2795//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2795//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2795//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12543218/mapreduce-2454.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified test files.

          -1 javac. The applied patch generated 2060 javac compiler warnings (more than the trunk's current 2059 warnings).

          -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient:

          org.apache.hadoop.hdfs.server.namenode.TestStorageRestore
          org.apache.hadoop.hdfs.TestHftpDelegationToken

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2798//testReport/
          Javac warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2798//artifact/trunk/patchprocess/diffJavacWarnings.txt
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2798//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12543218/mapreduce-2454.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified test files. -1 javac. The applied patch generated 2060 javac compiler warnings (more than the trunk's current 2059 warnings). -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient: org.apache.hadoop.hdfs.server.namenode.TestStorageRestore org.apache.hadoop.hdfs.TestHftpDelegationToken +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2798//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2798//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2798//console This message is automatically generated.
          Hide
          Mariappan Asokan added a comment -

          Fixed the new test to get rid of compiler warnings.

          Show
          Mariappan Asokan added a comment - Fixed the new test to get rid of compiler warnings.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12543287/mapreduce-2454.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient:

          org.apache.hadoop.hdfs.TestHftpDelegationToken
          org.apache.hadoop.hdfs.TestDatanodeBlockScanner

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2799//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2799//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12543287/mapreduce-2454.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient: org.apache.hadoop.hdfs.TestHftpDelegationToken org.apache.hadoop.hdfs.TestDatanodeBlockScanner +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2799//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2799//console This message is automatically generated.
          Hide
          Mariappan Asokan added a comment -

          Got rid of Javadoc warning.

          Show
          Mariappan Asokan added a comment - Got rid of Javadoc warning.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12543452/mapreduce-2454.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient:

          org.apache.hadoop.hdfs.server.namenode.TestListCorruptFileBlocks

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2810//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2810//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12543452/mapreduce-2454.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient: org.apache.hadoop.hdfs.server.namenode.TestListCorruptFileBlocks +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2810//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2810//console This message is automatically generated.
          Hide
          Mariappan Asokan added a comment -

          Rebasing to a previous version of trunk.

          Show
          Mariappan Asokan added a comment - Rebasing to a previous version of trunk.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12543461/mapreduce-2454.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient:

          org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks
          org.apache.hadoop.hdfs.server.datanode.TestDirectoryScanner

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2811//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2811//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12543461/mapreduce-2454.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient: org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks org.apache.hadoop.hdfs.server.datanode.TestDirectoryScanner +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2811//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2811//console This message is automatically generated.
          Hide
          Mariappan Asokan added a comment -

          Made some minor change to create single instances of DefaultMapSortPlugin and DefaultReduceSortPlugin.

          Show
          Mariappan Asokan added a comment - Made some minor change to create single instances of DefaultMapSortPlugin and DefaultReduceSortPlugin.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12543878/mapreduce-2454.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient:

          org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2814//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2814//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12543878/mapreduce-2454.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient: org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2814//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2814//console This message is automatically generated.
          Hide
          Mariappan Asokan added a comment -

          Hi Arun,
          The failing test TestBlocksWithNotEnoughRacks seems to be unrelated to this patch. MiniDFSCluster is failing during shutdown. Can you please review the patch while I look at the cause of failure?

          Thanks.

          – Asokan

          Show
          Mariappan Asokan added a comment - Hi Arun, The failing test TestBlocksWithNotEnoughRacks seems to be unrelated to this patch. MiniDFSCluster is failing during shutdown. Can you please review the patch while I look at the cause of failure? Thanks. – Asokan
          Hide
          Mariappan Asokan added a comment -

          The failing test is passing when run on my machine. Submitted the patch with some minor updates.

          Show
          Mariappan Asokan added a comment - The failing test is passing when run on my machine. Submitted the patch with some minor updates.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12545413/mapreduce-2454.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2856//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2856//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12545413/mapreduce-2454.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2856//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2856//console This message is automatically generated.
          Hide
          Mariappan Asokan added a comment -

          Hi Arun,
          Can you please review the patch? If you need any additional information about the patch itself, please let me know.

          Thanks.
          – Asokan

          Show
          Mariappan Asokan added a comment - Hi Arun, Can you please review the patch? If you need any additional information about the patch itself, please let me know. Thanks. – Asokan
          Hide
          Mariappan Asokan added a comment -

          Posted a patch for MAPREDUCE-4482(backport of MAPREDUCE-2454 to Hadoop 1.2) and a revised design document there.

          Show
          Mariappan Asokan added a comment - Posted a patch for MAPREDUCE-4482 (backport of MAPREDUCE-2454 to Hadoop 1.2) and a revised design document there.
          Hide
          Mariappan Asokan added a comment -

          I would like to receive feedback from Arun, Alejandro, Sean, Deepak, or other developers, and committers who are watching this Jira so that it can be committed.
          Thanks.

          – Asokan

          Show
          Mariappan Asokan added a comment - I would like to receive feedback from Arun, Alejandro, Sean, Deepak, or other developers, and committers who are watching this Jira so that it can be committed. Thanks. – Asokan
          Hide
          Sean Zhang added a comment -

          This is going to enable a new set of potential for Hadoop. I hope this can be committed soon.

          Show
          Sean Zhang added a comment - This is going to enable a new set of potential for Hadoop. I hope this can be committed soon.
          Hide
          Alejandro Abdelnur added a comment -

          Initial feedback on the patch (I'll do a more detailed review):

          • Nice work
          • patch needs rebase, TestReduceTask.java has been moved to hadoop-mapreduce-client-jobclient/
          • remove introduced & unused imports through out the patch
          • reformat lines with over 80 chars through out the patch

          I'm not trilled on how we are mixing mapred and mapreduce classes in the APIs of pluggable sort. But given how the current MR stuff implementation is done, I don't think it is possible to avoid that without a mayor cleanup/refactoring of much bigger scope.

          One thing would be quite useful, and I'd say a pre-requisite before committing it, is a performance comparison of terasort with and without the patch; we shouldn't be introducing a sensible performance penalty.

          Show
          Alejandro Abdelnur added a comment - Initial feedback on the patch (I'll do a more detailed review): Nice work patch needs rebase, TestReduceTask.java has been moved to hadoop-mapreduce-client-jobclient/ remove introduced & unused imports through out the patch reformat lines with over 80 chars through out the patch I'm not trilled on how we are mixing mapred and mapreduce classes in the APIs of pluggable sort. But given how the current MR stuff implementation is done, I don't think it is possible to avoid that without a mayor cleanup/refactoring of much bigger scope. One thing would be quite useful, and I'd say a pre-requisite before committing it, is a performance comparison of terasort with and without the patch; we shouldn't be introducing a sensible performance penalty.
          Hide
          Mariappan Asokan added a comment -

          Hi Alejandro,
          Thanks for your feedback. I have rebased to the recent trunk version. Moved TestReduceTask.java to hadoop-mapreduce-client-jobclient. Removed unused imports and reformatted the lines to 80 chars.

          Preliminary runs of terasort benchmark on a smaller cluster did not show any performance degradation. I am trying to get access to a 10 node cluster to run the terasort benchmarks with and without my patch. Please let me know whether that is good enough for comparing. Once I finish running the benchmark, I will post the results.

          Thanks once again for reviewing the patch.

          – Asokan

          Show
          Mariappan Asokan added a comment - Hi Alejandro, Thanks for your feedback. I have rebased to the recent trunk version. Moved TestReduceTask.java to hadoop-mapreduce-client-jobclient. Removed unused imports and reformatted the lines to 80 chars. Preliminary runs of terasort benchmark on a smaller cluster did not show any performance degradation. I am trying to get access to a 10 node cluster to run the terasort benchmarks with and without my patch. Please let me know whether that is good enough for comparing. Once I finish running the benchmark, I will post the results. Thanks once again for reviewing the patch. – Asokan
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12549046/mapreduce-2454.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2928//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2928//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12549046/mapreduce-2454.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 5 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2928//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2928//console This message is automatically generated.
          Hide
          Arun C Murthy added a comment -

          Mariappan, I've started taking a look - there is a lot to digest here. I apologize this has taken so long, but I've personally started on this many, many times and then got distracted since there is so much here for me to review. I'm sure it's the same for several other committers - it's not an excuse, but unfortunately you are asking all reviewers here for significant investment of their time...

          This is much more so because you are fiddling with some of the most core pieces of the MR framework (i.e. sorting and shuffling).

          Also, a meta point - it's also easier to work in an open-src community by starting with small and building up some credibility quickly by fixing bugs, tests etc. This was people are more comfortable when you make bigger changes.

          I'm not alone sharing this, see a recent interview with Todd:

          What is your advice for someone who is interested in participating in any open source project for the first time?

          Walk before you run. One mistake I’ve seen new contributors make is that they try to start off with a huge chunk of work at the core of the system. Instead, learn your way around the source code by doing small improvements, bug fixes, etc. Then, when you want to propose a larger change, the rest of the community will feel more comfortable accepting it. One great way to build karma in the community is to look at recently failing unit tests, file bugs, and fix them up.

          In general, it would be better if you could break this into a series of smaller patches and do this work in a development branch. This will make it easier to review.

          I understand this is frustrating, I apologize - but this is unfortunately a lot of work and highly risky one to boot.


          Having said this - how do we proceed?

          Let's start a discussion on mr-dev@ on design review, a dev branch where we make a series of small changes and then proceed there-on. Thoughts?

          Again, apologies.

          Show
          Arun C Murthy added a comment - Mariappan, I've started taking a look - there is a lot to digest here. I apologize this has taken so long, but I've personally started on this many, many times and then got distracted since there is so much here for me to review. I'm sure it's the same for several other committers - it's not an excuse, but unfortunately you are asking all reviewers here for significant investment of their time... This is much more so because you are fiddling with some of the most core pieces of the MR framework (i.e. sorting and shuffling). Also, a meta point - it's also easier to work in an open-src community by starting with small and building up some credibility quickly by fixing bugs, tests etc. This was people are more comfortable when you make bigger changes. I'm not alone sharing this, see a recent interview with Todd: What is your advice for someone who is interested in participating in any open source project for the first time? Walk before you run. One mistake I’ve seen new contributors make is that they try to start off with a huge chunk of work at the core of the system. Instead, learn your way around the source code by doing small improvements, bug fixes, etc. Then, when you want to propose a larger change, the rest of the community will feel more comfortable accepting it. One great way to build karma in the community is to look at recently failing unit tests, file bugs, and fix them up. In general, it would be better if you could break this into a series of smaller patches and do this work in a development branch. This will make it easier to review. I understand this is frustrating, I apologize - but this is unfortunately a lot of work and highly risky one to boot. Having said this - how do we proceed? Let's start a discussion on mr-dev@ on design review, a dev branch where we make a series of small changes and then proceed there-on. Thoughts? Again, apologies.
          Hide
          Mariappan Asokan added a comment -

          Hi Arun,
          Thanks for your feedback. Though I have confidence in my contribution(I have been running Terasort without any problems for data sizes of 2 TB on a small cluster), I understand your concerns on the size of the patch. I can think of the following sub-steps each addressed in a different Jira:

          • Refactor Task.java so that the classes ValuesIterator, CombinerRunner, and CombineValuesIterator, CombineOutputCollector can be taken out to separate
            files.
          • Refactor MapOutput.java to create InMemoryMapOutput and OnDiskMapOutput classes.
          • Refactor Shuffle.java and MergeManager.java to decouple shuffle and merge. This should also allow one to make shuffle pluggable. There will be a small change to ReduceTask.java as part of this decoupling since ReduceTask will instantiate both Shuffle and MergeManager objects.
          • Refactor MapTask.java so that the code related to sort on the map side is moved to a new file MapSort.java. Introduce SortinRecordWriter and SortoutRecordReader classes as part of this refactoring.
          • Refactor ReduceTask.java so that merge related code is moved to a new file ReduceSort.java.
          • Define corresponding interfaces for MapSort and ReduceSort classes and make these implementations pluggable.

          How does the above sequence of changes sound to you? I can raise separate Jiras for each one. We can keep these changes in a separate branch before moving to the trunk if you wish.

          If you have other suggestions, please let me know.

          Thanks again.

          – Asokan

          Show
          Mariappan Asokan added a comment - Hi Arun, Thanks for your feedback. Though I have confidence in my contribution(I have been running Terasort without any problems for data sizes of 2 TB on a small cluster), I understand your concerns on the size of the patch. I can think of the following sub-steps each addressed in a different Jira: Refactor Task.java so that the classes ValuesIterator , CombinerRunner , and CombineValuesIterator , CombineOutputCollector can be taken out to separate files. Refactor MapOutput.java to create InMemoryMapOutput and OnDiskMapOutput classes. Refactor Shuffle.java and MergeManager.java to decouple shuffle and merge. This should also allow one to make shuffle pluggable. There will be a small change to ReduceTask.java as part of this decoupling since ReduceTask will instantiate both Shuffle and MergeManager objects. Refactor MapTask.java so that the code related to sort on the map side is moved to a new file MapSort.java . Introduce SortinRecordWriter and SortoutRecordReader classes as part of this refactoring. Refactor ReduceTask.java so that merge related code is moved to a new file ReduceSort.java . Define corresponding interfaces for MapSort and ReduceSort classes and make these implementations pluggable. How does the above sequence of changes sound to you? I can raise separate Jiras for each one. We can keep these changes in a separate branch before moving to the trunk if you wish. If you have other suggestions, please let me know. Thanks again. – Asokan
          Hide
          Mariappan Asokan added a comment -

          Hi Arun,
          I stepped back and looked at some of the Jiras that were developed under branches(like MAPREDUCE-279, HDFS-1623, etc.) They contributed major enhancements to Hadoop. MAPREDUCE-2454 on the other hand has just refactored existing code to make some classes more modular and pluggable.

          There is no significant new functionality added. There is no change in the way shuffling is done. The communication between shuffle and merge is kept intact. The MAPREDUCE-318 refactoring in fact helped a lot. There is only a minor rearranging of the code in MAPREDUCE-2454.

          I followed your advice diligently to work on the trunk first and I picked up suggestions from some other developers who are watching this Jira. During our brief meetings, the suggestions you gave were very valuable and I followed them.

          I have been testing the changes for the last year or so and I have been keeping the Jira up to the latest trunk. I have not hit any issues in my testing w.r.t both functionally and performance.

          I understand that your time is very precious. I already updated the design document for you to make it easier for code review. If you can suggest anything that will expedite the committing, please let me know.

          Thanks.

          – Asokan

          Show
          Mariappan Asokan added a comment - Hi Arun, I stepped back and looked at some of the Jiras that were developed under branches(like MAPREDUCE-279 , HDFS-1623 , etc.) They contributed major enhancements to Hadoop. MAPREDUCE-2454 on the other hand has just refactored existing code to make some classes more modular and pluggable. There is no significant new functionality added. There is no change in the way shuffling is done. The communication between shuffle and merge is kept intact. The MAPREDUCE-318 refactoring in fact helped a lot. There is only a minor rearranging of the code in MAPREDUCE-2454 . I followed your advice diligently to work on the trunk first and I picked up suggestions from some other developers who are watching this Jira. During our brief meetings, the suggestions you gave were very valuable and I followed them. I have been testing the changes for the last year or so and I have been keeping the Jira up to the latest trunk. I have not hit any issues in my testing w.r.t both functionally and performance. I understand that your time is very precious. I already updated the design document for you to make it easier for code review. If you can suggest anything that will expedite the committing, please let me know. Thanks. – Asokan
          Hide
          Arun C Murthy added a comment -

          Asokan, sorry I've been busy with stuff - thanks for understanding.

          I've spent sometime thinking about this - and I feel we can do something far simpler to address Syncsort's goal of plugging in your proprietary sort while mitigating risk to MR itself.

          How about this: I feel we could accomplish both goals by something very simple... by making MapOutputBuffer pluggable by introducing a MapOutputCollector interface. That's about it. This way, you can supply a custom MapOutputBuffer which plugs in your sort for your customers while we can just keep our current implementation.

          Hopefully, that makes sense. What else would you need?

          I'm basically trying to vastly minimize the APIs we spread out, this way when we want to change our sort implementation for MAPREDUCE-4039 or Sailfish etc. we have much more leeway, at the same time we don't affect you at all.

          Thoughts?

          Show
          Arun C Murthy added a comment - Asokan, sorry I've been busy with stuff - thanks for understanding. I've spent sometime thinking about this - and I feel we can do something far simpler to address Syncsort's goal of plugging in your proprietary sort while mitigating risk to MR itself. How about this: I feel we could accomplish both goals by something very simple... by making MapOutputBuffer pluggable by introducing a MapOutputCollector interface. That's about it. This way, you can supply a custom MapOutputBuffer which plugs in your sort for your customers while we can just keep our current implementation. Hopefully, that makes sense. What else would you need? I'm basically trying to vastly minimize the APIs we spread out, this way when we want to change our sort implementation for MAPREDUCE-4039 or Sailfish etc. we have much more leeway, at the same time we don't affect you at all. Thoughts?
          Hide
          Mariappan Asokan added a comment -

          Hi Arun,
          Thank you very much for allotting some time to have a conversation with you during Strata 2012. Here is the list of items we discussed and how I followed up in the new patch.

          • With YARN, different MR data processing engines can co-exist in addition to the sort/merge done after map and before reduce. Keeping this in mind, I am calling the sort plugin interface on the map side as PostMapProcessor.
            Similarly, the merge done on the reduce side will be abstracted as PreReduceProcessor.
          • The PostMapProcessor can simply extend the existing MapOutputCollector with an initialize() method. The current MapOutputBuffer
            in MapTask.java will implement this interface as the default implementation.
          • On the reduce side, my suggestion is to define PreReduceProcessor based on methods already available in MergeManager class. With minimal changes, this will allow MergeManager to implement
            Unknown macro: {PreReduceProcessor.}

            }

          • There is a concern about exposing some APIs as public. Since the revised patch is much smaller than the one submitted before(one fourth of the original patch size), the chance of breaking anything is minimized. Also, I feel that only a handful of developers will write plugins. I have marked all the exposed APIs with proper annotations that APIs are not stable and there is a risk using them. The plugin developers should keep up with the changes in the exposed APIs. The core Hadoop developers need not worry about maintaining backward compatibility.

          The revised patch can be easily integrated with shuffle plugin.

          I repeatedly ran terasort benchmark on a cluster with 55 nodes. The performance difference with and without the patch was egligible(plus or minus 1%.)

          I would like to receive feedback from you and other developers who are watching this Jira. In the meantime, I am creating a new test to test the plugin.

          Thanks.
          – Asokan

          Show
          Mariappan Asokan added a comment - Hi Arun, Thank you very much for allotting some time to have a conversation with you during Strata 2012. Here is the list of items we discussed and how I followed up in the new patch. With YARN, different MR data processing engines can co-exist in addition to the sort/merge done after map and before reduce. Keeping this in mind, I am calling the sort plugin interface on the map side as PostMapProcessor. Similarly, the merge done on the reduce side will be abstracted as PreReduceProcessor. The PostMapProcessor can simply extend the existing MapOutputCollector with an initialize() method. The current MapOutputBuffer in MapTask.java will implement this interface as the default implementation. On the reduce side, my suggestion is to define PreReduceProcessor based on methods already available in MergeManager class. With minimal changes, this will allow MergeManager to implement Unknown macro: {PreReduceProcessor.} } There is a concern about exposing some APIs as public. Since the revised patch is much smaller than the one submitted before(one fourth of the original patch size), the chance of breaking anything is minimized. Also, I feel that only a handful of developers will write plugins. I have marked all the exposed APIs with proper annotations that APIs are not stable and there is a risk using them. The plugin developers should keep up with the changes in the exposed APIs. The core Hadoop developers need not worry about maintaining backward compatibility. The revised patch can be easily integrated with shuffle plugin. I repeatedly ran terasort benchmark on a cluster with 55 nodes. The performance difference with and without the patch was egligible(plus or minus 1%.) I would like to receive feedback from you and other developers who are watching this Jira. In the meantime, I am creating a new test to test the plugin. Thanks. – Asokan
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12552499/mapreduce-2454.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          -1 javac. The applied patch generated 2026 javac compiler warnings (more than the trunk's current 2025 warnings).

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2992//testReport/
          Javac warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2992//artifact/trunk/patchprocess/diffJavacWarnings.txt
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2992//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12552499/mapreduce-2454.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. -1 javac . The applied patch generated 2026 javac compiler warnings (more than the trunk's current 2025 warnings). +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2992//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2992//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2992//console This message is automatically generated.
          Hide
          Mariappan Asokan added a comment -

          Fixed the compiler warning.

          Show
          Mariappan Asokan added a comment - Fixed the compiler warning.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12552530/mapreduce-2454.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2994//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2994//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12552530/mapreduce-2454.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2994//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2994//console This message is automatically generated.
          Hide
          Alejandro Abdelnur added a comment -

          Asokan,

          I like the new approach, it is much simpler.

          Patch looks good, some minor comments:

          • Do we need 2 different interfaces MapOutputCollector and PostMapProcessor? Couldn't we collapse both on PostMapProcessor?
          • If keeping 2 different interfaces, make MapOutputCollector an outer interface
          • MapTask class should be annotated as public/unstable
          • MRJobConfig.java line 496 has a false space diff
          • PreReduceProcessor should be annotated as public/unstable
          Show
          Alejandro Abdelnur added a comment - Asokan, I like the new approach, it is much simpler. Patch looks good, some minor comments: Do we need 2 different interfaces MapOutputCollector and PostMapProcessor? Couldn't we collapse both on PostMapProcessor? If keeping 2 different interfaces, make MapOutputCollector an outer interface MapTask class should be annotated as public/unstable MRJobConfig.java line 496 has a false space diff PreReduceProcessor should be annotated as public/unstable
          Hide
          Mariappan Asokan added a comment -

          Hi Alejandro,
          Thanks for the feedback. I changed MapOutputCollector to PostMapProcessor so that there is only one interface. I also made the other changes you suggested. I am uploading the new patch. Please take a look at it.

          – Asokan

          Show
          Mariappan Asokan added a comment - Hi Alejandro, Thanks for the feedback. I changed MapOutputCollector to PostMapProcessor so that there is only one interface. I also made the other changes you suggested. I am uploading the new patch. Please take a look at it. – Asokan
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12552689/mapreduce-2454.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2999//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2999//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12552689/mapreduce-2454.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2999//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2999//console This message is automatically generated.
          Hide
          Alejandro Abdelnur added a comment -

          Asokan, if I understood you correctly you were working in a new testcase. This is not in the latest patch, correct? When you upload the patch with the new testcase please fix the following nit:

          Fetcher.java has 2 unused imports

          import java.io.InputStream;
          import java.io.OutputStream;

          Then, IMO we are good to go.

          Show
          Alejandro Abdelnur added a comment - Asokan, if I understood you correctly you were working in a new testcase. This is not in the latest patch, correct? When you upload the patch with the new testcase please fix the following nit: Fetcher.java has 2 unused imports import java.io.InputStream; import java.io.OutputStream; Then, IMO we are good to go.
          Hide
          Mariappan Asokan added a comment -

          Hi Alejandro,
          Thanks for catching the unused imports. I updated Fetcher.java. I have also added a test in the latest patch.

          – Asokan

          Show
          Mariappan Asokan added a comment - Hi Alejandro, Thanks for catching the unused imports. I updated Fetcher.java. I have also added a test in the latest patch. – Asokan
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12552875/mapreduce-2454.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient:

          org.apache.hadoop.mapreduce.TestJobMonitorAndPrint
          org.apache.hadoop.mapred.TestClusterMRNotification

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3005//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3005//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12552875/mapreduce-2454.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient: org.apache.hadoop.mapreduce.TestJobMonitorAndPrint org.apache.hadoop.mapred.TestClusterMRNotification +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3005//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3005//console This message is automatically generated.
          Hide
          Mariappan Asokan added a comment -

          Hi Alejandro,
          I ran the tests on my box. The failing tests are failing without my patch. The failure does not seem to be related to my patch.

          – Asokan

          Show
          Mariappan Asokan added a comment - Hi Alejandro, I ran the tests on my box. The failing tests are failing without my patch. The failure does not seem to be related to my patch. – Asokan
          Hide
          Mariappan Asokan added a comment -

          Hi Alejandro,
          I made the following changes:

          • Removed some unused parameters from Shuffle constructor.
          • Made some methods public in CombinerRunner which is defined in Task.java so that CombinerRunner can be reused from plug-in implementations.

          Please provide your feedback.

          Thanks.
          – Asokan

          Show
          Mariappan Asokan added a comment - Hi Alejandro, I made the following changes: Removed some unused parameters from Shuffle constructor. Made some methods public in CombinerRunner which is defined in Task.java so that CombinerRunner can be reused from plug-in implementations. Please provide your feedback. Thanks. – Asokan
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12552985/mapreduce-2454.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient:

          org.apache.hadoop.mapreduce.TestJobMonitorAndPrint

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3008//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3008//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12552985/mapreduce-2454.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient: org.apache.hadoop.mapreduce.TestJobMonitorAndPrint +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3008//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3008//console This message is automatically generated.
          Hide
          Mariappan Asokan added a comment -

          Hi Alejandro,
          I made minor changes in annotations. Also, I am summarizing the visibility changes in the following table(Hope it is formatted correctly. If not I will attach a separate file to this Jira):

          Class Protection Annotation Reason
          org.apache.hadoop.mapred.MapTask Changed from package protected to public Annotated as Public, Unstable To be able to access inner interface PostMapProcessor and inner class MapOutputBuffer from external plug-ins.
          org.apache.hadoop.mapred.MapTask.MapOutputCollector Name changed to PostMapProcessor and protection changed from package protected to public Annotated as Public, Unstable To be able to implement external plug-ins.
          org.apache.hadoop.mapred.MapTask.MapOutputBuffer Changed to a static inner class and protection changed from package protected to public Annotated as Public, Unstable To be able to use the default implementation of PostMapProcessor.
          org.apache.hadoop.mapred.SpillRecord Changed from package protected to public Annotated as Public, Unstable To be able to write the intermediate output from PostMapProcessor plug-ins.
          org.apache.hadoop.mapred.Task.CombinerRunner Changed from protected to public Changed from Private, Unstable to Public, Unstable For code reuse to run the Combiner from external PostMapProcessor plug-ins.
          org.apache.hadoop.mapred.Task.TaskReporter Changed from protected to public Changed from Private, Unstable to Public, Unstable Has to be passed to CombinerRunner from external PostMapProcessor plug-ins.
          org.apache.hadoop.mapreduce.task.reduce.ExceptionReporter Changed from package protected to public Annotated as Public, Unstable To report exception to Shuffle from PreReduceProcessor.
          org.apache.hadoop.mapreduce.task.reduce.MapHost Changed from package protected to public Annotated as Public, Unstable Passed to shuffle() method in MapOutput.
          org.apache.hadoop.mapreduce.task.reduce.MapOutput Changed from package protected to public Annotated as Public, Unstable Returned from reserve() method in PreReduceProcessor.
          org.apache.hadoop.mapreduce.task.reduce.MergeManager No change Changed from Private, Unstable to Public, Unstable To reuse the default implementation of PreReduceProcessor.
          org.apache.hadoop.mapreduce.task.reduce.Shuffle No change Changed from Private, Unstable to Public, Unstable Passed to external PreReduceProcessor plug-ins.
          org.apache.hadoop.mapreduce.task.reduce.ShuffleClientMetrics Changed from package protected to public Annotated as Public, Unstable Passed to shuffle() method in external MapOutput.
          Show
          Mariappan Asokan added a comment - Hi Alejandro, I made minor changes in annotations. Also, I am summarizing the visibility changes in the following table(Hope it is formatted correctly. If not I will attach a separate file to this Jira): Class Protection Annotation Reason org.apache.hadoop.mapred.MapTask Changed from package protected to public Annotated as Public, Unstable To be able to access inner interface PostMapProcessor and inner class MapOutputBuffer from external plug-ins. org.apache.hadoop.mapred.MapTask.MapOutputCollector Name changed to PostMapProcessor and protection changed from package protected to public Annotated as Public, Unstable To be able to implement external plug-ins. org.apache.hadoop.mapred.MapTask.MapOutputBuffer Changed to a static inner class and protection changed from package protected to public Annotated as Public, Unstable To be able to use the default implementation of PostMapProcessor . org.apache.hadoop.mapred.SpillRecord Changed from package protected to public Annotated as Public, Unstable To be able to write the intermediate output from PostMapProcessor plug-ins. org.apache.hadoop.mapred.Task.CombinerRunner Changed from protected to public Changed from Private, Unstable to Public, Unstable For code reuse to run the Combiner from external PostMapProcessor plug-ins. org.apache.hadoop.mapred.Task.TaskReporter Changed from protected to public Changed from Private, Unstable to Public, Unstable Has to be passed to CombinerRunner from external PostMapProcessor plug-ins. org.apache.hadoop.mapreduce.task.reduce.ExceptionReporter Changed from package protected to public Annotated as Public, Unstable To report exception to Shuffle from PreReduceProcessor. org.apache.hadoop.mapreduce.task.reduce.MapHost Changed from package protected to public Annotated as Public, Unstable Passed to shuffle() method in MapOutput. org.apache.hadoop.mapreduce.task.reduce.MapOutput Changed from package protected to public Annotated as Public, Unstable Returned from reserve() method in PreReduceProcessor. org.apache.hadoop.mapreduce.task.reduce.MergeManager No change Changed from Private, Unstable to Public, Unstable To reuse the default implementation of PreReduceProcessor. org.apache.hadoop.mapreduce.task.reduce.Shuffle No change Changed from Private, Unstable to Public, Unstable Passed to external PreReduceProcessor plug-ins. org.apache.hadoop.mapreduce.task.reduce.ShuffleClientMetrics Changed from package protected to public Annotated as Public, Unstable Passed to shuffle() method in external MapOutput.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12553151/mapreduce-2454.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3017//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3017//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12553151/mapreduce-2454.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3017//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3017//console This message is automatically generated.
          Hide
          Alejandro Abdelnur added a comment - - edited

          Asokan,

          Thanks for patience working out the design and implementation of this JIRA.

          The latest patch looks good and it addresses the concerns voiced regarding big changes that could destabilize the MR framework.

          There are few minor NITs that should be addressed in the patch (at the end of this comment).

          +1 after these NITs are addressed and jenkins test-patch OKs the new patch.

          Patch NITs:

          • PreReduceProcessor.java: unused import: Shuffle
          • ReduceTasks.java: line 357, PreReduceProcessor merger = (PreReduceProcessor) ReflectionUtils.newInstance(..., no need for the "(PreReduceProcessor)" casting.
          • Shuffle.java, unused imports: FileSystem, LocalDirAllocator, CompressionCodec, MapOutputFile, RawKeyValueIterator, Reducer, CombineOutputCollector
          • TestLimitNQuery.java, unused imports: FileInputStream, FileNotFoundException, FileStatus, JobCounter, TaskAttemptID, ReflectionUtils
          Show
          Alejandro Abdelnur added a comment - - edited Asokan, Thanks for patience working out the design and implementation of this JIRA. The latest patch looks good and it addresses the concerns voiced regarding big changes that could destabilize the MR framework. There are few minor NITs that should be addressed in the patch (at the end of this comment). +1 after these NITs are addressed and jenkins test-patch OKs the new patch. Patch NITs: PreReduceProcessor.java: unused import: Shuffle ReduceTasks.java: line 357, PreReduceProcessor merger = (PreReduceProcessor) ReflectionUtils.newInstance(..., no need for the "(PreReduceProcessor)" casting. Shuffle.java, unused imports: FileSystem, LocalDirAllocator, CompressionCodec, MapOutputFile, RawKeyValueIterator, Reducer, CombineOutputCollector TestLimitNQuery.java, unused imports: FileInputStream, FileNotFoundException, FileStatus, JobCounter, TaskAttemptID, ReflectionUtils
          Hide
          Mariappan Asokan added a comment -

          Hi Alejandro,
          Thanks for the pointing out the nits. I fixed all of them. I am uploading a new patch.

          – Asokan

          Show
          Mariappan Asokan added a comment - Hi Alejandro, Thanks for the pointing out the nits. I fixed all of them. I am uploading a new patch. – Asokan
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12553200/mapreduce-2454.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3019//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3019//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12553200/mapreduce-2454.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3019//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3019//console This message is automatically generated.
          Hide
          Alejandro Abdelnur added a comment -

          +1

          Show
          Alejandro Abdelnur added a comment - +1
          Hide
          Alejandro Abdelnur added a comment -

          I'm planning to commit this patch around mid day PST today.

          Show
          Alejandro Abdelnur added a comment - I'm planning to commit this patch around mid day PST today.
          Hide
          Arun C Murthy added a comment -

          Asokan - thanks again for being patient and working with me through this, and also for spending time with me in person on the design.
          I think this is very close! Thanks again!

          Also, Alejandro, thanks for taking a look.

          Some comments:

          1. All the apis should be marked 'LimitedPrivate' (not Public), 'Unstable' to make it clear that this is only for implementers.
          2. I'd rather keep the names as MapOutputBuffer or MapOutputSortingOutput rather than PostMapProcessor to make it clear that this is the sort buffer. Similarly, we should just call it ReduceInputMerger or some such?
          3. We shouldn't need to make TaskReporter public since we already have a public Reporter api, correct?
          4. Perhaps one of my biggest concerns is about MapOutput, I don't understand why it has a 'shuffle' method and shuffles the output itself - it is merely meant to be an abstraction of the output - can you pls help me understand this?
          5. Since you've already made SpillRecord (java) public, we don't have to move out IndexRecord outside?
          6. In general, it would be useful if you could not make formatting changes in a large patch to keep it smaller - I see it's gone from 60K or so 100+K again!

          Thanks again.

          Show
          Arun C Murthy added a comment - Asokan - thanks again for being patient and working with me through this, and also for spending time with me in person on the design. I think this is very close! Thanks again! Also, Alejandro, thanks for taking a look. Some comments: All the apis should be marked 'LimitedPrivate' (not Public), 'Unstable' to make it clear that this is only for implementers. I'd rather keep the names as MapOutputBuffer or MapOutputSortingOutput rather than PostMapProcessor to make it clear that this is the sort buffer. Similarly, we should just call it ReduceInputMerger or some such? We shouldn't need to make TaskReporter public since we already have a public Reporter api, correct? Perhaps one of my biggest concerns is about MapOutput, I don't understand why it has a 'shuffle' method and shuffles the output itself - it is merely meant to be an abstraction of the output - can you pls help me understand this? Since you've already made SpillRecord (java) public, we don't have to move out IndexRecord outside? In general, it would be useful if you could not make formatting changes in a large patch to keep it smaller - I see it's gone from 60K or so 100+K again! Thanks again.
          Hide
          Mariappan Asokan added a comment -

          Hi Arun,
          Thanks for your comments. I have addressed each one of them below in the same order:

          • I have modified all the annotations to LimitedPrivate instead of Public.
          • Kept the name and moved MapOutputCollector to a separate file. Also, renamed PreReduceProcessor to ReduceInputMerger.
          • The class CombinerRunner expects a Task.TaskReporter object to be passed to the create() method. CombinerRunner will be used by plugin implementations to run the Combiner. Also, MapOutputBuffer expects TaskReporter not just Reporter.
          • Currently, MapOutput has commit() and abort() methods for the shuffled data. It is natural to have shuffle() in there too. Besides, shuffle() will become polymorphic so that OnDiskMapOutput, InMemoryMapOutput, or plugin implementations can implement shuffle() differently. Currently, there is an if-then check in Fetcher.java(to decide whether to shuffle to disk or memory) which I thought was not very clean.
          • We need to make IndexRecord public. Once we do that, Java compiler does not like two public classes at the same level in a single source. The other possibility is to make IndexRecord as a static inner class of SpillRecord. I tried to do that, but I got compilation errors from four other source and test files since I have to add an import statement widening the scope of the changes.
          • The test I am contributing to the patch is making it look bigger. The test has the implementation of a plugin that avoids sorting. I am planning to contribute a more robust implementation of such a plugin separately for MAPREDUCE-4039 if no one else volunteers As part of that, I can modify this test to make use of that plugin so that it will become much smaller.
            To give you an idea of the breakdown of the patch, in addition to the overall patch file I am attaching multiple patch files each with the following items addressed:
            • patch with only access protection and annotation changes(mapreduce-2454-protection-change.patch)
            • patch with only refactored and modified code(mapreduce-2454-modified-code.patch)
            • patch with only modified test(mapreduce-2454-modified-test.patch)
            • patch with only new test added(mapreduce-2454-new-test.patch)

          Once again thank you very much for allotting some time to look at the patch.

          – Asokan

          Show
          Mariappan Asokan added a comment - Hi Arun, Thanks for your comments. I have addressed each one of them below in the same order: I have modified all the annotations to LimitedPrivate instead of Public. Kept the name and moved MapOutputCollector to a separate file. Also, renamed PreReduceProcessor to ReduceInputMerger. The class CombinerRunner expects a Task.TaskReporter object to be passed to the create() method. CombinerRunner will be used by plugin implementations to run the Combiner. Also, MapOutputBuffer expects TaskReporter not just Reporter. Currently, MapOutput has commit() and abort() methods for the shuffled data. It is natural to have shuffle() in there too. Besides, shuffle() will become polymorphic so that OnDiskMapOutput , InMemoryMapOutput , or plugin implementations can implement shuffle() differently. Currently, there is an if-then check in Fetcher.java(to decide whether to shuffle to disk or memory) which I thought was not very clean. We need to make IndexRecord public. Once we do that, Java compiler does not like two public classes at the same level in a single source. The other possibility is to make IndexRecord as a static inner class of SpillRecord. I tried to do that, but I got compilation errors from four other source and test files since I have to add an import statement widening the scope of the changes. The test I am contributing to the patch is making it look bigger. The test has the implementation of a plugin that avoids sorting. I am planning to contribute a more robust implementation of such a plugin separately for MAPREDUCE-4039 if no one else volunteers As part of that, I can modify this test to make use of that plugin so that it will become much smaller. To give you an idea of the breakdown of the patch, in addition to the overall patch file I am attaching multiple patch files each with the following items addressed: patch with only access protection and annotation changes(mapreduce-2454-protection-change.patch) patch with only refactored and modified code(mapreduce-2454-modified-code.patch) patch with only modified test(mapreduce-2454-modified-test.patch) patch with only new test added(mapreduce-2454-new-test.patch) Once again thank you very much for allotting some time to look at the patch. – Asokan
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12553731/mapreduce-2454.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient:

          org.apache.hadoop.mapreduce.v2.TestMRJobsWithHistoryService

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3036//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3036//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12553731/mapreduce-2454.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient: org.apache.hadoop.mapreduce.v2.TestMRJobsWithHistoryService +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3036//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3036//console This message is automatically generated.
          Hide
          Mariappan Asokan added a comment -

          The failing test does not seem to be related to my patch.

          I made one last annotation change(Task.TaskReporter was still marked as Public) and I am uploading a new patch file.

          Show
          Mariappan Asokan added a comment - The failing test does not seem to be related to my patch. I made one last annotation change(Task.TaskReporter was still marked as Public) and I am uploading a new patch file.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12553789/mapreduce-2454.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient:

          org.apache.hadoop.mapreduce.v2.TestMRJobsWithHistoryService

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3038//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3038//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12553789/mapreduce-2454.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient: org.apache.hadoop.mapreduce.v2.TestMRJobsWithHistoryService +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3038//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3038//console This message is automatically generated.
          Hide
          Mariappan Asokan added a comment -

          This test failure is not related to my patch.

          – Asokan

          Show
          Mariappan Asokan added a comment - This test failure is not related to my patch. – Asokan
          Hide
          Alejandro Abdelnur added a comment -

          +1. Asokan, thanks for following up and addressing Arun's comments. I think the patch is ready to be committed. I'll be committing later today.

          Show
          Alejandro Abdelnur added a comment - +1. Asokan, thanks for following up and addressing Arun's comments. I think the patch is ready to be committed. I'll be committing later today.
          Hide
          Arun C Murthy added a comment -

          Asokan, does the test fail without your patch?

          Tucu - pls hold on, let me do a final review since I've spent a lot of time on this codeline for a long time now. Thanks.

          Show
          Arun C Murthy added a comment - Asokan, does the test fail without your patch? Tucu - pls hold on, let me do a final review since I've spent a lot of time on this codeline for a long time now. Thanks.
          Hide
          Arun C Murthy added a comment -

          Ok, I'm glad I caught this...

          One issue is that you don't want to pass a shuffle down to the merger, it's the other way around - you want to pass a merger to the shuffle.

          Also, I'm not wild about making the change to MapOutput to shuffle itself, like I said - it was initially designed as merely a 'struct'.


          I've said this before on this jira, I'd really appreciate if you could break this apart into smaller chunks - it makes a reviewer's job much easier... for e.g. I missed the shuffle/merger change since it's a largish patch.

          Can you pls create some natural sub-tasks:

          1. Move MapOutputCollector out as an interface into a new class outside of MapTask
          2. Introduce a new ReduceInputMerger or some such interface which is sufficient for your purposes
          3. Then mark all interfaces you need as 'LimitedPrivate'

          I appreciate your patience, thanks again.

          Show
          Arun C Murthy added a comment - Ok, I'm glad I caught this... One issue is that you don't want to pass a shuffle down to the merger, it's the other way around - you want to pass a merger to the shuffle. Also, I'm not wild about making the change to MapOutput to shuffle itself, like I said - it was initially designed as merely a 'struct'. I've said this before on this jira, I'd really appreciate if you could break this apart into smaller chunks - it makes a reviewer's job much easier... for e.g. I missed the shuffle/merger change since it's a largish patch. Can you pls create some natural sub-tasks: Move MapOutputCollector out as an interface into a new class outside of MapTask Introduce a new ReduceInputMerger or some such interface which is sufficient for your purposes Then mark all interfaces you need as 'LimitedPrivate' I appreciate your patience, thanks again.
          Hide
          Mariappan Asokan added a comment -

          Hi Arun,
          Yes, the test fails without my patch. I confirmed it on my box.

          Thanks.

          – Asokan

          Show
          Mariappan Asokan added a comment - Hi Arun, Yes, the test fails without my patch. I confirmed it on my box. Thanks. – Asokan
          Hide
          Arun C Murthy added a comment -

          Again, I apologize for being very careful on this one... as you can imagine, this is a fairly core piece of MR and hence my careful consideration. Also, you could really help me out by breaking this apart into smaller chunks I can review/commit fast in an iterative manner. Thanks.

          Show
          Arun C Murthy added a comment - Again, I apologize for being very careful on this one... as you can imagine, this is a fairly core piece of MR and hence my careful consideration. Also, you could really help me out by breaking this apart into smaller chunks I can review/commit fast in an iterative manner. Thanks.
          Hide
          Mariappan Asokan added a comment -

          Hi Arun,
          Thanks for your comments. Regarding your comment on passing shuffle to merge: I did this with the following rationale:

          • Conceptually, merge can take its input from different types of sources; today we have two types: one is Shuffle and the other is from local files for a local job. Tomorrow, we may add a hybrid of Shuffle and local map output files(this will involve adding another method in the ReduceInputMerger) to avoid shuffling local map outputs for optimizing the performance. This new approach which decouples Shuffle and Merge is more flexible.
          • In current implementation, Shuffle which is supposed to transfer bytes from map outputs to the merger, is also returning a RawKeyValueIterator which in turn implies it is doing more than transferring bytes.

          – Asokan

          Show
          Mariappan Asokan added a comment - Hi Arun, Thanks for your comments. Regarding your comment on passing shuffle to merge: I did this with the following rationale: Conceptually, merge can take its input from different types of sources; today we have two types: one is Shuffle and the other is from local files for a local job. Tomorrow, we may add a hybrid of Shuffle and local map output files(this will involve adding another method in the ReduceInputMerger) to avoid shuffling local map outputs for optimizing the performance. This new approach which decouples Shuffle and Merge is more flexible. In current implementation, Shuffle which is supposed to transfer bytes from map outputs to the merger, is also returning a RawKeyValueIterator which in turn implies it is doing more than transferring bytes. – Asokan
          Hide
          Arun C Murthy added a comment -

          It's conceivable that we may take several directions, including push shuffle etc., which will require a sorter and not a merger.

          IAC, it's too early to call - hence I'd prefer we keep current semantics since they don't make a big difference either way, correct?

          Thanks.

          Show
          Arun C Murthy added a comment - It's conceivable that we may take several directions, including push shuffle etc., which will require a sorter and not a merger. IAC, it's too early to call - hence I'd prefer we keep current semantics since they don't make a big difference either way, correct? Thanks.
          Hide
          Mariappan Asokan added a comment -

          Hi Arun,
          I would like to make the following points:

          • We talked about different processing that can happen before the Reducer. Currently, we have a merge. It can be a sort as you mentioned or a simple copy as well. The copy case arises when one wants to avoid sorting that happens in the MR data flow. It would enable hash based aggregation or join in the Reducer.
          • Regardless of the processing done or whether shuffle is push or pull based, the processing should be in control of driving the processing not the shuffle. This is not obvious for a sort or merge. For a copy, it makes a big difference.
          • For a copy, we want the Reducer to receive the <key, value> pairs as soon as data is shuffled(unlike sort or merge which has to wait until the last <key, value> pair is seen before outputting the first <key, value> pair.) There is no need to spill data to disk on the reduce side.
          • With the current arrangement where shuffle assumes that the processing(merge) can return a RawKeyValueIterator only at the end of shuffling, it is impossible to support copy. There is inherent deadlock because copy wants to return the <key, value> pairs right away whereas shuffle
            thinks that it can happen only at the end.
          • The change I made is very simple. It does not alter any semantics and it allows the processing to be a copy without any deadlock. In fact, the test I created as part of this Jira does a simple copy before the Reducer.

          I hope I clarified the reason for the change.

          Thanks.

          – Asokan

          Show
          Mariappan Asokan added a comment - Hi Arun, I would like to make the following points: We talked about different processing that can happen before the Reducer. Currently, we have a merge . It can be a sort as you mentioned or a simple copy as well. The copy case arises when one wants to avoid sorting that happens in the MR data flow. It would enable hash based aggregation or join in the Reducer. Regardless of the processing done or whether shuffle is push or pull based, the processing should be in control of driving the processing not the shuffle. This is not obvious for a sort or merge . For a copy , it makes a big difference. For a copy , we want the Reducer to receive the <key, value> pairs as soon as data is shuffled(unlike sort or merge which has to wait until the last <key, value> pair is seen before outputting the first <key, value> pair.) There is no need to spill data to disk on the reduce side. With the current arrangement where shuffle assumes that the processing( merge ) can return a RawKeyValueIterator only at the end of shuffling, it is impossible to support copy . There is inherent deadlock because copy wants to return the <key, value> pairs right away whereas shuffle thinks that it can happen only at the end. The change I made is very simple. It does not alter any semantics and it allows the processing to be a copy without any deadlock. In fact, the test I created as part of this Jira does a simple copy before the Reducer. I hope I clarified the reason for the change. Thanks. – Asokan
          Hide
          Arun C Murthy added a comment -

          With the current arrangement where shuffle assumes that the processing(merge) can return a RawKeyValueIterator only at the end of shuffling

          That is not true at all.

          It's trivial to return an iterator from a copy-only shuffle which is backed by a blocking shuffle which waits till any (not all) key/value pairs have been shuffled over the network.

          So, I don't think the change is necessary.

          Show
          Arun C Murthy added a comment - With the current arrangement where shuffle assumes that the processing(merge) can return a RawKeyValueIterator only at the end of shuffling That is not true at all. It's trivial to return an iterator from a copy-only shuffle which is backed by a blocking shuffle which waits till any (not all ) key/value pairs have been shuffled over the network. So, I don't think the change is necessary.
          Hide
          Arun C Murthy added a comment -

          IAC, as I've asked many times before, can you pls upload the smaller patches to sub-tasks so we can keep discussions focussed on a single change rather than debate cross-cutting changes? This way I can review/commit simpler changes faster to progress rapidly? Thanks.

          Show
          Arun C Murthy added a comment - IAC, as I've asked many times before, can you pls upload the smaller patches to sub-tasks so we can keep discussions focussed on a single change rather than debate cross-cutting changes? This way I can review/commit simpler changes faster to progress rapidly? Thanks.
          Hide
          Alejandro Abdelnur added a comment -

          Arun,

          Thanks for taking the time to help me review this JIRA, I really appreciate that. I also agree this JIRA is in the core of MR and merits for extra attention it got.

          The current approach, which has been done following your recommendations, has been already thoroughly reviewed by you and myself. Plus the current size of the patch is approx 100K, with 33K of it in testcases, less than 15K of new code and the rest a straight forward refactoring of existing code.

          All the feedback you've provided has been integrated in the patch.

          There is only one discussion left, that you brought up, regarding 'passing a shuffle down to the merger'.

          Because of this, I don't think it is necessary to break up this JIRA in sub JIRAs at this point.

          Once we decide on this pending issue I want to commit this patch. Any further improvement/change/fix can be done incrementally with follow up JIRAs in the same way we work in the rest of Hadoop.

          Cheers

          Show
          Alejandro Abdelnur added a comment - Arun, Thanks for taking the time to help me review this JIRA, I really appreciate that. I also agree this JIRA is in the core of MR and merits for extra attention it got. The current approach, which has been done following your recommendations, has been already thoroughly reviewed by you and myself. Plus the current size of the patch is approx 100K, with 33K of it in testcases, less than 15K of new code and the rest a straight forward refactoring of existing code. All the feedback you've provided has been integrated in the patch. There is only one discussion left, that you brought up, regarding 'passing a shuffle down to the merger'. Because of this, I don't think it is necessary to break up this JIRA in sub JIRAs at this point. Once we decide on this pending issue I want to commit this patch. Any further improvement/change/fix can be done incrementally with follow up JIRAs in the same way we work in the rest of Hadoop. Cheers
          Hide
          Alejandro Abdelnur added a comment -

          Now following up with Arun's concern on 'passing a shuffle down to the merger', after spending some extra time looking at the code with and without the patch.

          I agree with Asokan's arguments on why the shuffle ought to be passed to the merger as in the latest patch.

          It is a clear separation of concerns, the shuffle only shuffles data without having to be aware of how that data is handled afterwards.

          The change does not change the end behavior of the shuffle-merge phase, thus it does not break any existing MR application. Nor it can break any existing Hadoop plugin (as all this was hardcoded and it could not be replaced).

          Also, the change does not preclude in the future implementing things like a push shuffle.

          Regarding Arun's suggestion:

          It's trivial to return an iterator from a copy-only shuffle which is backed by a blocking shuffle which waits till any (not all) key/value pairs have been shuffled over the network.

          This would require changes in the shuffle, which could significantly increase the scope of work of this JIRA. On the other hand, the latest patch does not modify the Shuffle.

          My take here is along the lines of Arun's comment:

          I've spent sometime thinking about this - and I feel we can do something far simpler to address Syncsort's goal of plugging in your proprietary sort while mitigating risk to MR itself....How about this: I feel we could accomplish both goals by something very simple..

          Echoing Arun, we are mitigating risk while enabling the desired functionality.

          Show
          Alejandro Abdelnur added a comment - Now following up with Arun's concern on 'passing a shuffle down to the merger', after spending some extra time looking at the code with and without the patch. I agree with Asokan's arguments on why the shuffle ought to be passed to the merger as in the latest patch. It is a clear separation of concerns, the shuffle only shuffles data without having to be aware of how that data is handled afterwards. The change does not change the end behavior of the shuffle-merge phase, thus it does not break any existing MR application. Nor it can break any existing Hadoop plugin (as all this was hardcoded and it could not be replaced). Also, the change does not preclude in the future implementing things like a push shuffle. Regarding Arun's suggestion: It's trivial to return an iterator from a copy-only shuffle which is backed by a blocking shuffle which waits till any (not all) key/value pairs have been shuffled over the network. This would require changes in the shuffle, which could significantly increase the scope of work of this JIRA. On the other hand, the latest patch does not modify the Shuffle. My take here is along the lines of Arun's comment: I've spent sometime thinking about this - and I feel we can do something far simpler to address Syncsort's goal of plugging in your proprietary sort while mitigating risk to MR itself....How about this: I feel we could accomplish both goals by something very simple.. Echoing Arun, we are mitigating risk while enabling the desired functionality.
          Hide
          Arun C Murthy added a comment -

          I agree with Asokan's arguments on why the shuffle ought to be passed to the merger as in the latest patch.

          -1

          As I've noted before we are making changes without actual necessity, while hurting future functionality such as reduce-side sort etc. by elevating 'merge' to be a primary operation. I don't see the need for this, I haven't seen an actual argument why it's required.

          IAC, I'm frankly getting tired of arguing details on a 100k patch - please break this up so that I can review/commit them and we can actually have focussed discussions. Thanks.

          Show
          Arun C Murthy added a comment - I agree with Asokan's arguments on why the shuffle ought to be passed to the merger as in the latest patch. -1 As I've noted before we are making changes without actual necessity, while hurting future functionality such as reduce-side sort etc. by elevating 'merge' to be a primary operation. I don't see the need for this, I haven't seen an actual argument why it's required. IAC, I'm frankly getting tired of arguing details on a 100k patch - please break this up so that I can review/commit them and we can actually have focussed discussions. Thanks.
          Hide
          Arun C Murthy added a comment -

          Ok, I've created the subtasks. Asokan - please attach the patches you already have over there.

          Show
          Arun C Murthy added a comment - Ok, I've created the subtasks. Asokan - please attach the patches you already have over there.
          Hide
          Alejandro Abdelnur added a comment -

          Asokan, in order to get things moving, would you mind uploading the broken down patches to the sub JIRAs?. I don't see any contention with the visibility and MAP side sub JIRAs, so those could go in right the way. Thx.

          Show
          Alejandro Abdelnur added a comment - Asokan, in order to get things moving, would you mind uploading the broken down patches to the sub JIRAs?. I don't see any contention with the visibility and MAP side sub JIRAs, so those could go in right the way. Thx.
          Hide
          Alejandro Abdelnur added a comment -

          Arun, just for the record, as I've told you over the phone I disagree with the reasons of your -1. But I'm Moving to subtasks to get this going.

          Show
          Alejandro Abdelnur added a comment - Arun, just for the record, as I've told you over the phone I disagree with the reasons of your -1. But I'm Moving to subtasks to get this going.
          Hide
          Mariappan Asokan added a comment -

          Hi Arun & Alejandro,
          I will follow your suggestions and create the sub-patches. I will post them as soon as I have them ready and tested.

          Thanks.
          – Asokan

          Show
          Mariappan Asokan added a comment - Hi Arun & Alejandro, I will follow your suggestions and create the sub-patches. I will post them as soon as I have them ready and tested. Thanks. – Asokan
          Hide
          Mariappan Asokan added a comment -

          Hi Arun,
          I just have one question for you. You stated the following:

          It's trivial to return an iterator from a copy-only shuffle which is backed by a blocking shuffle which waits till any (not all) key/value pairs have been shuffled over the network.

          So, I don't think the change is necessary.

          I am curious to know what you meant by this. Can you please elaborate?

          Thanks.
          – Asokan

          Show
          Mariappan Asokan added a comment - Hi Arun, I just have one question for you. You stated the following: It's trivial to return an iterator from a copy-only shuffle which is backed by a blocking shuffle which waits till any (not all) key/value pairs have been shuffled over the network. So, I don't think the change is necessary. I am curious to know what you meant by this. Can you please elaborate? Thanks. – Asokan
          Hide
          Arun C Murthy added a comment -

          Alejandro - you, or Asokan, haven't given a single reason why the change is required... I've asked for this multiple times. All I've heard is you disagree with my opinion, particularly after I've given you ideas to accomplish some fairly abstract goals in the future within the context of existing functionality. With due respect, it seems like we are falling into the trap of ignoring opinions from people who have spent a lot of time & effort maintaing the codebase without motivating changes sufficiently, especially when people have little context in the codebase they are suggesting major modifications in.

          IAC, let's continue this discussion on MAPREDUCE-4808.

          Show
          Arun C Murthy added a comment - Alejandro - you, or Asokan, haven't given a single reason why the change is required... I've asked for this multiple times. All I've heard is you disagree with my opinion, particularly after I've given you ideas to accomplish some fairly abstract goals in the future within the context of existing functionality. With due respect, it seems like we are falling into the trap of ignoring opinions from people who have spent a lot of time & effort maintaing the codebase without motivating changes sufficiently, especially when people have little context in the codebase they are suggesting major modifications in. IAC, let's continue this discussion on MAPREDUCE-4808 .
          Hide
          Laxman added a comment -

          Asokan & Alejandro, iiuc the objective/goal of this feature is to make the Shuffler & Merger pluggable. So that, mapreduce user can plugin better algorithms.

          We are in the process of implementing the "Network Levitated Merge" algorithm [ http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf ]. With this, we wanted to avoid the disk usage on reducer side completely and directly shuffle the map outputs to the reducer.

          IMO, with MAPREDUCE-2454 and other sub-tasks as well, we may not be able to plugin the above algorithm.

          Request you to provide your suggestions on this.

          Show
          Laxman added a comment - Asokan & Alejandro, iiuc the objective/goal of this feature is to make the Shuffler & Merger pluggable. So that, mapreduce user can plugin better algorithms. We are in the process of implementing the "Network Levitated Merge" algorithm [ http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf ]. With this, we wanted to avoid the disk usage on reducer side completely and directly shuffle the map outputs to the reducer. IMO, with MAPREDUCE-2454 and other sub-tasks as well, we may not be able to plugin the above algorithm. Request you to provide your suggestions on this.
          Hide
          Mariappan Asokan added a comment -

          Hi Laxman,
          The way the Shuffle plugin is defined, it can complement the Merger plugin or it can act alone in order to generate the final merged output.

          In the former case, the shuffle plugin can implement accelerated shuffling in software or in hardware+software combination. You can still use a Merger plugin to do the merge.

          In the latter case(this is what is referred as "Network Levitated Merge" in the paper you mention), the Shuffle plugin itself can do the shuffle+merge. In this case, the Shuffle plugin can choose to ignore the Merger plugin.

          In essence, with individual units pluggable you get more choices and not restrictions.

          – Asokan

          Show
          Mariappan Asokan added a comment - Hi Laxman, The way the Shuffle plugin is defined, it can complement the Merger plugin or it can act alone in order to generate the final merged output. In the former case, the shuffle plugin can implement accelerated shuffling in software or in hardware+software combination. You can still use a Merger plugin to do the merge. In the latter case(this is what is referred as "Network Levitated Merge" in the paper you mention), the Shuffle plugin itself can do the shuffle+merge. In this case, the Shuffle plugin can choose to ignore the Merger plugin. In essence, with individual units pluggable you get more choices and not restrictions. – Asokan
          Hide
          Laxman added a comment -

          Asokan, thank you very much for your quick response and detailed clarification.
          I will give a try with the patches available here. I will get back to you soon on this.

          Show
          Laxman added a comment - Asokan, thank you very much for your quick response and detailed clarification. I will give a try with the patches available here. I will get back to you soon on this.
          Hide
          Alejandro Abdelnur added a comment -

          I've asked Asokan for a preliminary patch with all the subtasks for MAPREDUCE-2454 and run gridmix both on trunk and on the patch in a 35 machines cluster. The trace had ~1000 jobs. I've done 2 runs with trunk and 2 runs with the patch.

          TRUNK:

          Time spent in simulation: 43mins, 31sec
          Time spent in simulation: 41mins, 28sec

          MAPREDUCE-2454:

          Time spent in simulation: 42mins, 48sec
          Time spent in simulation: 40mins, 8sec

          Show
          Alejandro Abdelnur added a comment - I've asked Asokan for a preliminary patch with all the subtasks for MAPREDUCE-2454 and run gridmix both on trunk and on the patch in a 35 machines cluster. The trace had ~1000 jobs. I've done 2 runs with trunk and 2 runs with the patch. TRUNK: Time spent in simulation: 43mins, 31sec Time spent in simulation: 41mins, 28sec MAPREDUCE-2454 : Time spent in simulation: 42mins, 48sec Time spent in simulation: 40mins, 8sec
          Hide
          Arun C Murthy added a comment -

          Thanks for the due diligence Tucu.

          With MAPREDUCE-4807 and MAPREDUCE-4809 we have accomplished the original goals of this jira i.e. make the sorter pluggable.

          With that let's close this my merging to trunk and unlink the other enhancements from this task, they are independent of the original goals of this and can be tackled as such.

          Asokan - thanks for being patient and working with us. I'm glad that at the end we are able to support your original goals of allowing Syncsort to plugin the custom sort, while, simultaneously, keeping our long-term maintenance costs low. Thanks again.

          Show
          Arun C Murthy added a comment - Thanks for the due diligence Tucu. With MAPREDUCE-4807 and MAPREDUCE-4809 we have accomplished the original goals of this jira i.e. make the sorter pluggable. With that let's close this my merging to trunk and unlink the other enhancements from this task, they are independent of the original goals of this and can be tackled as such. Asokan - thanks for being patient and working with us. I'm glad that at the end we are able to support your original goals of allowing Syncsort to plugin the custom sort, while, simultaneously, keeping our long-term maintenance costs low. Thanks again.
          Hide
          Alejandro Abdelnur added a comment -

          Great. I've just committed MAPREDUCE-4809 & MAPREDUCE-4807 to trunk. Thanks.

          Show
          Alejandro Abdelnur added a comment - Great. I've just committed MAPREDUCE-4809 & MAPREDUCE-4807 to trunk. Thanks.
          Hide
          Arun C Murthy added a comment -

          Thanks Tucu. Closing this.

          Show
          Arun C Murthy added a comment - Thanks Tucu. Closing this.
          Hide
          Mariappan Asokan added a comment -

          Hi Alejandro & Arun,
          Thanks to both of you for committing this.

          – Asokan

          Show
          Mariappan Asokan added a comment - Hi Alejandro & Arun, Thanks to both of you for committing this. – Asokan
          Hide
          Jerry Chen added a comment -

          Hi Asokan,
          Thank you for your work on this. I am following the whole thread and one of the thing that I feel interested in is pluggable sort in reduce side. While this was discussed and was in the intial patches. It seems to me that it is not in the final patch as I didn't see any interfaces mentioned above added or modified in reduce side.

          As my understanding is that the current sort in map reduce is participating by both Map and Reduce. Does replacing only map side make any sense?

          • Haifeng
          Show
          Jerry Chen added a comment - Hi Asokan, Thank you for your work on this. I am following the whole thread and one of the thing that I feel interested in is pluggable sort in reduce side. While this was discussed and was in the intial patches. It seems to me that it is not in the final patch as I didn't see any interfaces mentioned above added or modified in reduce side. As my understanding is that the current sort in map reduce is participating by both Map and Reduce. Does replacing only map side make any sense? Haifeng
          Hide
          Mariappan Asokan added a comment -

          Hi Jerry,
          Thanks for your interest. I will be continuing the work in MAPREDUCE-4808 to make the reduce side sort pluggable. I already posted a patch there. Please take a look at the interfaces defined in the patch. If you have any comments, please post them in MAPREDUCE-4808.

          – Asokan

          Show
          Mariappan Asokan added a comment - Hi Jerry, Thanks for your interest. I will be continuing the work in MAPREDUCE-4808 to make the reduce side sort pluggable. I already posted a patch there. Please take a look at the interfaces defined in the patch. If you have any comments, please post them in MAPREDUCE-4808 . – Asokan
          Hide
          Jerry Chen added a comment -

          Hi Asokan,
          Thanks for your clarification and I realized now that there is MAPREDUCE-4808 there.

          Show
          Jerry Chen added a comment - Hi Asokan, Thanks for your clarification and I realized now that there is MAPREDUCE-4808 there.

            People

            • Assignee:
              Mariappan Asokan
              Reporter:
              Mariappan Asokan
            • Votes:
              0 Vote for this issue
              Watchers:
              58 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development