Hadoop Common
  1. Hadoop Common
  2. HADOOP-3702

add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.19.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      all

    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      Introduced ChainMapper and the ChainReducer classes to allow composing chains of Maps and Reduces in a single Map/Reduce job, something like MAP+ REDUCE MAP*.
      Show
      Introduced ChainMapper and the ChainReducer classes to allow composing chains of Maps and Reduces in a single Map/Reduce job, something like MAP+ REDUCE MAP*.

      Description

      On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.

      If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided.

      Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

      1. Hadoop-3702.patch
        60 kB
        Enis Soztutar
      2. patch3702.txt
        60 kB
        Alejandro Abdelnur
      3. patch3702.txt
        59 kB
        Alejandro Abdelnur
      4. patch3702.txt
        59 kB
        Alejandro Abdelnur
      5. patch3702.txt
        58 kB
        Alejandro Abdelnur
      6. patch3702.txt
        57 kB
        Alejandro Abdelnur
      7. patch3702.txt
        56 kB
        Alejandro Abdelnur
      8. patch3702.txt
        54 kB
        Alejandro Abdelnur
      9. patch3702.txt
        53 kB
        Alejandro Abdelnur
      10. patch3702.txt
        47 kB
        Alejandro Abdelnur
      11. patch3702.txt
        45 kB
        Alejandro Abdelnur
      12. patch3702.txt
        33 kB
        Alejandro Abdelnur
      13. patch3702.txt
        30 kB
        Alejandro Abdelnur
      14. patch3702.txt
        30 kB
        Alejandro Abdelnur

        Issue Links

          Activity

          Alejandro Abdelnur created issue -
          Hide
          Alejandro Abdelnur added a comment - - edited

          This could be done with ChainMapper and ChainReducer classes that would manage the chain of Maps and they would override the OutputCollector to implement the chaining.

          The Maps and Reduce that are part of the Chain are unware they are executed in a Chain, they receive records via the map and reduce methods and do the output via the OutputCollector.

          The API would look something like:

          
          public class ChainMapper implements Mapper {
          
            public static void addMapper(JobConf job, Class<? extends Mapper> klass, Properties mapperConf);
            ...
          }
          
          public class ChainReducer implements Reducer {
          
            public static void setReducer(JobConf job, Class<? extends Reducer> klass, Properties reducerConf);
          
            public static void addMapper(JobConf job, Class<? extends Mapper> klass, Properties mapperConf);
            ...
          }
          
          

          The Properties configuration passed to the Mapper and Reducer when setting them into the chain are injected into a copy of the job's configuration. This allows maps to be configured as usual without being aware that they are in a chain.

          Example of creating and submitting a chain job:

          
          JobConf conf = new JobConf();
          
          // chaining maps in the Map phase
          
          Properties mapAConf = new Properties();
          mapAConf.setProperty("a", "A");
          ChainMapper.addMapper(conf, AMap.class, mapAConf);
          
          ChainMapper.addMapper(conf, BMap.class, null);
          
          // setting the reducer
          
          Properties reduceConf = new Properties();
          ChainReducer.setReducer(conf, XReduce.class, reduceConf);
          
          // chaining maps in the Reduce phase
          
          ChainReducer.addMapper(conf, CMap.class, null);
          
          ChainReducer.addMapper(conf, DMap.class, null);
          
          ...
          
          FileInputFormat.setInputPaths(conf, inDir);
          FileOutputFormat.setOutputPath(conf, outDir);
          
          JobClient jc = new JobClient(conf);
          RunningJob job = jc.submitJob(conf);
          
          
          Show
          Alejandro Abdelnur added a comment - - edited This could be done with ChainMapper and ChainReducer classes that would manage the chain of Maps and they would override the OutputCollector to implement the chaining. The Maps and Reduce that are part of the Chain are unware they are executed in a Chain, they receive records via the map and reduce methods and do the output via the OutputCollector . The API would look something like: public class ChainMapper implements Mapper { public static void addMapper(JobConf job, Class <? extends Mapper> klass, Properties mapperConf); ... } public class ChainReducer implements Reducer { public static void setReducer(JobConf job, Class <? extends Reducer> klass, Properties reducerConf); public static void addMapper(JobConf job, Class <? extends Mapper> klass, Properties mapperConf); ... } The Properties configuration passed to the Mapper and Reducer when setting them into the chain are injected into a copy of the job's configuration. This allows maps to be configured as usual without being aware that they are in a chain. Example of creating and submitting a chain job: JobConf conf = new JobConf(); // chaining maps in the Map phase Properties mapAConf = new Properties(); mapAConf.setProperty( "a" , "A" ); ChainMapper.addMapper(conf, AMap.class, mapAConf); ChainMapper.addMapper(conf, BMap.class, null ); // setting the reducer Properties reduceConf = new Properties(); ChainReducer.setReducer(conf, XReduce.class, reduceConf); // chaining maps in the Reduce phase ChainReducer.addMapper(conf, CMap.class, null ); ChainReducer.addMapper(conf, DMap.class, null ); ... FileInputFormat.setInputPaths(conf, inDir); FileOutputFormat.setOutputPath(conf, outDir); JobClient jc = new JobClient(conf); RunningJob job = jc.submitJob(conf);
          Alejandro Abdelnur made changes -
          Field Original Value New Value
          Description On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.

          If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided.

          Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

          This could be done with ChainMapper and ChainReducer classes that would manage the chain of Maps and they would override the OutputCollector to implement the chaining.

          The Maps and Reduce that are part of the Chain are unware they are executed in a Chain, they receive records via the {{map}} and {{reduce}} methods and do the output via the {{OutputCollector}}.

          The API would look something like:

          {code:java}

          public class ChainMapper implements Mapper {

            public static void addMapper(JobConf job, Class<? extends Mapper> klass, Properties mapperConf);
            ...
          }

          public class ChainReducer implements Reducer {

            public static void setReducer(JobConf job, Class<? extends Reducer> klass, Properties reducerConf);

            public static void addMapper(JobConf job, Class<? extends Mapper> klass, Properties mapperConf);
            ...
          }

          {code}

          The {{Properties}} configuration passed to the {{Mapper}} and {{Reducer}} when setting them into the chain are injected into a copy of the job's configuration. This allows maps to be configured as usual without being aware that they are in a chain.
          On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.

          If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided.

          Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12385705 ]
          Hide
          Alejandro Abdelnur added a comment -

          Proposed implementation. Different from the initial comment, it uses a JobConf to configure the Maps in the Chain.

          Show
          Alejandro Abdelnur added a comment - Proposed implementation. Different from the initial comment, it uses a JobConf to configure the Maps in the Chain.
          Alejandro Abdelnur made changes -
          Release Note The ChainMapper and the ChainReducer classes allow composing chains of Maps and Reduces in a single Map/Reduce job, something like MAP+ / REDUCE MAP*.
          An immediate benefit of this pattern is reduction in disk IO as many Maps can be club together in a single job.
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Chris Douglas added a comment -

          Very basic question: what are the semantics of a mapper in this chain calling collect(K,V)? Currently, it is guaranteed that neither the key nor the value will be modified, so the following must hold:

          key.set(some_value);
          value.set(some_other_value);
          collect(key, value);
          assert key.get().equals(some_value);
          assert value.get().equals(some_other_value);
          

          Chaining mappers can violate this property unless the following maps guarantee (by convention, presumably) that they will not modify either argument. It might make sense to require chained mappers (excluding the final mapper) to implement a different interface- even if that interface is empty- to promise to treat the record as const.

          Show
          Chris Douglas added a comment - Very basic question: what are the semantics of a mapper in this chain calling collect(K,V)? Currently, it is guaranteed that neither the key nor the value will be modified, so the following must hold: key.set(some_value); value.set(some_other_value); collect(key, value); assert key.get().equals(some_value); assert value.get().equals(some_other_value); Chaining mappers can violate this property unless the following maps guarantee (by convention, presumably) that they will not modify either argument. It might make sense to require chained mappers (excluding the final mapper) to implement a different interface- even if that interface is empty- to promise to treat the record as const.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12385705/patch3702.txt
          against trunk revision 676069.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to cause Findbugs to fail.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2842/testReport/
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2842/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2842/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12385705/patch3702.txt against trunk revision 676069. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to cause Findbugs to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2842/testReport/ Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2842/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2842/console This message is automatically generated.
          Hide
          Runping Qi added a comment -

          I don't understand why the map chaining is necessary.
          Why not just compoite all the function into one and use the composited one as the mapper or reducer.

          Show
          Runping Qi added a comment - I don't understand why the map chaining is necessary. Why not just compoite all the function into one and use the composited one as the mapper or reducer.
          Hide
          Alejandro Abdelnur added a comment -
          • On Chris' comment:

          You are right, I've missed that point in my proposed implementation.

          To address that the Chain code should take care of cloning the key and value before passing them to the following Map in the chain.

          Still, as optimization (to avoid serializing/deserializing keys and values) for every link in the chain a passByReference property could be set.

          • On Runping's comment:

          In our current implementation we are doing as you are suggesting, this means we have our own set of interfaces for processing, not Mappers and Reducers, and we have a ChainMapper and a ChainReducer that manage the lifecycle of our private interfaces.

          Using Mapper/Reducer interfaces directly is cleaner, more consistent and simpler for developers.

          Show
          Alejandro Abdelnur added a comment - On Chris' comment: You are right, I've missed that point in my proposed implementation. To address that the Chain code should take care of cloning the key and value before passing them to the following Map in the chain. Still, as optimization (to avoid serializing/deserializing keys and values) for every link in the chain a passByReference property could be set. On Runping's comment: In our current implementation we are doing as you are suggesting, this means we have our own set of interfaces for processing, not Mappers and Reducers, and we have a ChainMapper and a ChainReducer that manage the lifecycle of our private interfaces. Using Mapper/Reducer interfaces directly is cleaner, more consistent and simpler for developers.
          Hide
          Alejandro Abdelnur added a comment -

          Fixing test and style errors.

          Still not addressing by value semantics on the chain

          Show
          Alejandro Abdelnur added a comment - Fixing test and style errors. Still not addressing by value semantics on the chain
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12385952 ]
          Alejandro Abdelnur made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12385952 ]
          Hide
          Alejandro Abdelnur added a comment -

          fixing test and style errors.

          still missing is pass by value semantics

          Show
          Alejandro Abdelnur added a comment - fixing test and style errors. still missing is pass by value semantics
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12385953 ]
          Alejandro Abdelnur made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12385953/patch3702.txt
          against trunk revision 676069.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2854/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2854/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2854/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2854/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12385953/patch3702.txt against trunk revision 676069. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2854/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2854/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2854/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2854/console This message is automatically generated.
          Alejandro Abdelnur made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Alejandro Abdelnur added a comment -

          refactoring code to add pass by-value support in the chain.

          everything is in place except the ser/deser implementation, the ChainOutputCollector.passByValue() method (coming soon).

          Show
          Alejandro Abdelnur added a comment - refactoring code to add pass by-value support in the chain. everything is in place except the ser/deser implementation, the ChainOutputCollector.passByValue() method (coming soon).
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12385965 ]
          Hide
          Alejandro Abdelnur added a comment -

          Another thing missing in the current patch is specifying the key/value in/out classes for the maps and reduce in the chain and making sure they are compatible between the elements of the chain.

          Show
          Alejandro Abdelnur added a comment - Another thing missing in the current patch is specifying the key/value in/out classes for the maps and reduce in the chain and making sure they are compatible between the elements of the chain.
          Hide
          Alejandro Abdelnur added a comment -

          Added byValue support to preserve the semantics of the collector not modifying the key and values.

          Still byReference can be used as an optimization if the mappers and reducer in the chain don't use the byValue semantics.

          Improved the patch to allow different key/value classes to be used by different elements in the chain, only restriction is that the output of a mapper has to match the expected input of the next mapper.

          Sample usage:

          
            JobConf conf = new JobConf();
            conf.setJobName("chain");
            
            conf.setInputFormat(TextInputFormat.class);
            conf.setOutputFormat(TextOutputFormat.class);
            
            FileInputFormat.setInputPaths(conf, inDir);
            FileOutputFormat.setOutputPath(conf, outDir);
          
            JobConf mapAConf = new JobConf();
            ChainMapper.addMapper(conf, AMap.class, LongWritable.class, Text.class, Text.class, Text.class, true, mapAConf);
           
            JobConf mapBConf = new JobConf();
            ChainMapper.addMapper(conf, BMap.class, Text.class, Text.class, LongWritable.class, Text.class, false, mapBConf);
           
            JobConf reduceConf = new JobConf();
            ChainReducer.setReducer(conf, XReduce.class, LongWritable.class, Text.class, Text.class, Text.class, true, reduceConf);
           
            ChainReducer.addMapper(conf, CMap.class, Text.class, Text.class, LongWritable.class, Text.class, false, null);
           
            ChainReducer.addMapper(conf, DMap.class, LongWritable.class, Text.class, LongWritable.class, LongWritable.class, true, null);
             ...
           
            JobClient jc = new JobClient(conf);
            RunningJob job = jc.submitJob(conf);
          

          Note: the previous to last boolean parameter indicates if the Mapper/Reducer added in to the chain wants a byValue (TRUE) or byReference (FALSE)

          Show
          Alejandro Abdelnur added a comment - Added byValue support to preserve the semantics of the collector not modifying the key and values. Still byReference can be used as an optimization if the mappers and reducer in the chain don't use the byValue semantics. Improved the patch to allow different key/value classes to be used by different elements in the chain, only restriction is that the output of a mapper has to match the expected input of the next mapper. Sample usage: JobConf conf = new JobConf(); conf.setJobName( "chain" ); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, inDir); FileOutputFormat.setOutputPath(conf, outDir); JobConf mapAConf = new JobConf(); ChainMapper.addMapper(conf, AMap.class, LongWritable.class, Text.class, Text.class, Text.class, true , mapAConf); JobConf mapBConf = new JobConf(); ChainMapper.addMapper(conf, BMap.class, Text.class, Text.class, LongWritable.class, Text.class, false , mapBConf); JobConf reduceConf = new JobConf(); ChainReducer.setReducer(conf, XReduce.class, LongWritable.class, Text.class, Text.class, Text.class, true , reduceConf); ChainReducer.addMapper(conf, CMap.class, Text.class, Text.class, LongWritable.class, Text.class, false , null ); ChainReducer.addMapper(conf, DMap.class, LongWritable.class, Text.class, LongWritable.class, LongWritable.class, true , null ); ... JobClient jc = new JobClient(conf); RunningJob job = jc.submitJob(conf); Note: the previous to last boolean parameter indicates if the Mapper/Reducer added in to the chain wants a byValue (TRUE) or byReference (FALSE)
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12386160 ]
          Alejandro Abdelnur made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Alejandro Abdelnur made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Alejandro Abdelnur added a comment -

          Using JobConf(false) for the chain mappers and reducer jobconfs when created internally to reduce size in chain job jobconf.

          Reusing the ByteArrayOutputStream used for byValue passing, using a threadlocal to work with the MultiThreadedMapRunner.

          Show
          Alejandro Abdelnur added a comment - Using JobConf(false) for the chain mappers and reducer jobconfs when created internally to reduce size in chain job jobconf. Reusing the ByteArrayOutputStream used for byValue passing, using a threadlocal to work with the MultiThreadedMapRunner.
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12386208 ]
          Alejandro Abdelnur made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Release Note The ChainMapper and the ChainReducer classes allow composing chains of Maps and Reduces in a single Map/Reduce job, something like MAP+ / REDUCE MAP*.
          An immediate benefit of this pattern is reduction in disk IO as many Maps can be club together in a single job.
          The ChainMapper and the ChainReducer classes allow composing chains of Maps and Reduces in a single Map/Reduce job, something like MAP+ / REDUCE MAP*. An immediate benefit of this pattern is reduction in disk IO as many Maps can be club together in a single job.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12386208/patch3702.txt
          against trunk revision 677379.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2880/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2880/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2880/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2880/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12386208/patch3702.txt against trunk revision 677379. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2880/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2880/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2880/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2880/console This message is automatically generated.
          Hide
          Owen O'Malley added a comment -

          I don't like that each stage in the pipeline has its own configuration and that you serialize them all and put them into the outer configuration. What is the use case for that? If you are going to do that, wouldn't it be easier to just make Configuration Writable and use the writable stringifier?

          You should use hadoop.io.Data

          {In,Out}

          putBuffer rather than defining your own DirectBufferByteArrayOutputStream, especially since the name sounds like a direct buffer.

          You should probably convert the task id string to a TaskAttemptID and call the isMap method rather than parsing the taskid string.

          The preferred style is Sun's:

          if (...) {
          } else {
          }
          

          Other than that, it seems good.

          Show
          Owen O'Malley added a comment - I don't like that each stage in the pipeline has its own configuration and that you serialize them all and put them into the outer configuration. What is the use case for that? If you are going to do that, wouldn't it be easier to just make Configuration Writable and use the writable stringifier? You should use hadoop.io.Data {In,Out} putBuffer rather than defining your own DirectBufferByteArrayOutputStream, especially since the name sounds like a direct buffer. You should probably convert the task id string to a TaskAttemptID and call the isMap method rather than parsing the taskid string. The preferred style is Sun's: if (...) { } else { } Other than that, it seems good.
          Hide
          Alejandro Abdelnur added a comment -

          Thanks for the feedback.

          On "each stage in the pipeline having its own configuration .. put them into the outer configuration":

          The use case is that mappers/reducer in the chain (I'm not using 'pipe' to avoid confusion as it has other use in Hadoop) are not aware of each other, and that they are in a chain, and they may have collision in the configuration keys they use.

          On "If you are going to do that, ... make Configuration Writable and use the writable stringifier?":

          That was my initial idea but it would have required changes in the Configuration and I was avoiding any change in the core.

          I'll make this change.

          *On "use {{hadoop.io.Data

          {In,Out}

          putBuffer}} ...":*

          I'll look at it.

          On "You should probably convert the task id string to a TaskAttemptID ...":

          OK

          On "preferred sytle is Sun's":

          OK

          Show
          Alejandro Abdelnur added a comment - Thanks for the feedback. On "each stage in the pipeline having its own configuration .. put them into the outer configuration": The use case is that mappers/reducer in the chain (I'm not using 'pipe' to avoid confusion as it has other use in Hadoop) are not aware of each other, and that they are in a chain, and they may have collision in the configuration keys they use. On "If you are going to do that, ... make Configuration Writable and use the writable stringifier?": That was my initial idea but it would have required changes in the Configuration and I was avoiding any change in the core. I'll make this change. *On "use {{hadoop.io.Data {In,Out} putBuffer}} ...":* I'll look at it. On "You should probably convert the task id string to a TaskAttemptID ...": OK On "preferred sytle is Sun's": OK
          Hide
          Alejandro Abdelnur added a comment -

          Integrating Owen's comments:

          • Modified Configuration to be a Writable
          • Modified WritableUtils to have methods to convert Writable to/from String
          • Modified Chain to use above methods instead hack with Properties
          • Using hadoop DataOutputBuffer
          • Using if/else Sun's sytle

          I could not figure out the comment on taskAttempId as the patch is not using taskId at all to find out if its a map or reduce task. The ChainMapper and ChainReducer create a Chain to delegate common logic and at creation time it indicates what is to be use for.

          Show
          Alejandro Abdelnur added a comment - Integrating Owen's comments: Modified Configuration to be a Writable Modified WritableUtils to have methods to convert Writable to/from String Modified Chain to use above methods instead hack with Properties Using hadoop DataOutputBuffer Using if/else Sun's sytle I could not figure out the comment on taskAttempId as the patch is not using taskId at all to find out if its a map or reduce task. The ChainMapper and ChainReducer create a Chain to delegate common logic and at creation time it indicates what is to be use for.
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12386712 ]
          Alejandro Abdelnur made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12386712 ]
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12386713 ]
          Alejandro Abdelnur made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Alejandro Abdelnur made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Alejandro Abdelnur added a comment -

          Inverted precedence of configuration.

          The chain element JobConf configuration has precedence over the chain job JobConf.

          This allows to set some defaults at chain job JobConf level and override them at chain element level.

          Note that this does not break anything at runtime as ll the injected configuration for the task is used at the Chain

          {Mapper,Reducer}
          Show
          Alejandro Abdelnur added a comment - Inverted precedence of configuration. The chain element JobConf configuration has precedence over the chain job JobConf. This allows to set some defaults at chain job JobConf level and override them at chain element level. Note that this does not break anything at runtime as ll the injected configuration for the task is used at the Chain {Mapper,Reducer}
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12386720 ]
          Alejandro Abdelnur made changes -
          Assignee Alejandro Abdelnur [ tucu00 ] Christophe Taton [ kryzthov ]
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12386720/patch3702.txt
          against trunk revision 679202.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to cause Findbugs to fail.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2930/testReport/
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2930/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2930/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12386720/patch3702.txt against trunk revision 679202. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to cause Findbugs to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2930/testReport/ Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2930/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2930/console This message is automatically generated.
          Hide
          Alejandro Abdelnur added a comment -

          there is an ambiguity with Configuration.write due to making Configuration implement Writable.

          The exiting Configuration.write(OutputStream) that writes out XML must be renamed to something like writeXml(OutputStream)

          Show
          Alejandro Abdelnur added a comment - there is an ambiguity with Configuration.write due to making Configuration implement Writable. The exiting Configuration.write(OutputStream) that writes out XML must be renamed to something like writeXml(OutputStream)
          Alejandro Abdelnur made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Alejandro Abdelnur made changes -
          Assignee Christophe Taton [ kryzthov ] Alejandro Abdelnur [ tucu00 ]
          Hide
          Alejandro Abdelnur added a comment -

          renaming original Configure.write(OutputStream) to {{Configure.writeXml(OutputStream) to avoid ambiguity with the write(DataOutput) introduced by implementing Writable. The ambiguity occurs when when using a DataOutputStream (this happens in the JobHistory, JobClient, TaskTracker, TaskRunner).

          Show
          Alejandro Abdelnur added a comment - renaming original Configure.write(OutputStream) to {{ Configure.writeXml(OutputStream) to avoid ambiguity with the write(DataOutput) introduced by implementing Writable . The ambiguity occurs when when using a DataOutputStream (this happens in the JobHistory, JobClient, TaskTracker, TaskRunner ).
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12386782 ]
          Alejandro Abdelnur made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12386782/patch3702.txt
          against trunk revision 679506.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2941/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2941/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2941/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2941/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12386782/patch3702.txt against trunk revision 679506. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2941/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2941/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2941/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2941/console This message is automatically generated.
          Hide
          Alejandro Abdelnur added a comment -

          TestPipes is failing due to a signature change in the Configuration, from write(OutputStream) to writeXml(OutputStream).

          Show
          Alejandro Abdelnur added a comment - TestPipes is failing due to a signature change in the Configuration , from write(OutputStream) to writeXml(OutputStream) .
          Alejandro Abdelnur made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Alejandro Abdelnur added a comment -

          Fixing TestPipes test case.

          By making Configuration implement Writable there is an ambiguity with the write(OutputStream) that already existed in Configuration and the write(DataOuptut) that is brought in by implementing Writable. The ambiguity arises when using a DataOutputStream, to resolve this I've renamed the original write(OutputStream) to writeXml(OutputStream).

          Show
          Alejandro Abdelnur added a comment - Fixing TestPipes test case. By making Configuration implement Writable there is an ambiguity with the write(OutputStream) that already existed in Configuration and the write(DataOuptut) that is brought in by implementing Writable . The ambiguity arises when using a DataOutputStream , to resolve this I've renamed the original write(OutputStream) to writeXml(OutputStream) .
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12386850 ]
          Alejandro Abdelnur made changes -
          Release Note The ChainMapper and the ChainReducer classes allow composing chains of Maps and Reduces in a single Map/Reduce job, something like MAP+ / REDUCE MAP*. An immediate benefit of this pattern is reduction in disk IO as many Maps can be club together in a single job.
          The ChainMapper and the ChainReducer classes allow composing chains of Maps and Reduces in a single Map/Reduce job, something like MAP+ / REDUCE MAP*. An immediate benefit of this pattern is reduction in disk IO as many Maps can be club together in a single job.

          The Configuration write(OutputStream) method has been renamed to writeXml(OutputStream) to avoid ambiguity with the Writable write(DataOutput) method.
          Hadoop Flags [Incompatible change]
          Alejandro Abdelnur made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12386850/patch3702.txt
          against trunk revision 679772.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 12 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2951/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2951/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2951/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2951/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12386850/patch3702.txt against trunk revision 679772. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 12 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2951/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2951/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2951/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2951/console This message is automatically generated.
          Hide
          Chris Douglas added a comment -
          • Instead of adding WritableUtils::asString, it might make more sense to use the o.a.h.io.Stringifier interfaces, particularly since you create a new JobConf (permitting a user to pass in an object would be an excellent, but not strictly necessary, addition to Stringifier, too)
          • I'm not sure I understand how the configuration deserialization works. In Chain::getChainElementConf, a new config is created, its fields cleared and populated from the stream, then each property defined in the deserialized conf is (re)defined on a clone of the JobConf passed in (presumably to permit final, etc. to be observed). Is that accurate?
          • Is it true that each call to map creates a series of ChainOutputCollectors? These look like lightweight objects, but is there a reason the pipeline needs to be recreated each time?
          • It looks like this broke the eclipse plugin:

            [exec] compile:
            [exec] [echo] contrib: eclipse-plugin
            [exec] [javac] Compiling 30 source files to /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/build/contrib/eclipse-plugin/classes
            [exec] [javac] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/contrib/eclipse-plugin/src/java/org/apache/hadoop/eclipse/server/HadoopServer.java:422: write(java.io.DataOutput) in org.apache.hadoop.conf.Configuration cannot be applied to (java.io.FileOutputStream)
            [exec] [javac] this.conf.write(fos);
            [exec] [javac] ^
            [exec] [javac] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/contrib/eclipse-plugin/src/java/org/apache/hadoop/eclipse/servers/RunOnHadoopWizard.java:166: write(java.io.DataOutput) in org.apache.hadoop.conf.Configuration cannot be applied to (java.io.FileOutputStream)
            [exec] [javac] conf.write(fos);
            [exec] [javac] ^

          Show
          Chris Douglas added a comment - Instead of adding WritableUtils::asString, it might make more sense to use the o.a.h.io.Stringifier interfaces, particularly since you create a new JobConf (permitting a user to pass in an object would be an excellent, but not strictly necessary, addition to Stringifier, too) I'm not sure I understand how the configuration deserialization works. In Chain::getChainElementConf, a new config is created, its fields cleared and populated from the stream, then each property defined in the deserialized conf is (re)defined on a clone of the JobConf passed in (presumably to permit final, etc. to be observed). Is that accurate? Is it true that each call to map creates a series of ChainOutputCollectors ? These look like lightweight objects, but is there a reason the pipeline needs to be recreated each time? It looks like this broke the eclipse plugin: [exec] compile: [exec] [echo] contrib: eclipse-plugin [exec] [javac] Compiling 30 source files to /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/build/contrib/eclipse-plugin/classes [exec] [javac] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/contrib/eclipse-plugin/src/java/org/apache/hadoop/eclipse/server/HadoopServer.java:422: write(java.io.DataOutput) in org.apache.hadoop.conf.Configuration cannot be applied to (java.io.FileOutputStream) [exec] [javac] this.conf.write(fos); [exec] [javac] ^ [exec] [javac] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/contrib/eclipse-plugin/src/java/org/apache/hadoop/eclipse/servers/RunOnHadoopWizard.java:166: write(java.io.DataOutput) in org.apache.hadoop.conf.Configuration cannot be applied to (java.io.FileOutputStream) [exec] [javac] conf.write(fos); [exec] [javac] ^
          Hide
          Alejandro Abdelnur added a comment -

          On the first bullet item, Finally got the Stringifier thing, Thanks.

          On the second bullet item, the serialized chain element JobConf (A) is deserialized, a new JobConf (B) with the task's JobConf as parent is created (to ensure all Hadoop runtime values are avail), the values of A are applied to B to ensure the config values of the chain element have precedence when the chain mapper/reducer gets its conf, B.

          On the third bullet item, Yes, you are right. 2 reasons, one is to keep code simpler, the second is that if a MultithreadedMapRunner is used a single pipeline will not work as it is not thread safe.

          On the fourth bullet item, this will go away as I'm using Stringifier in the next version of patch.

          Show
          Alejandro Abdelnur added a comment - On the first bullet item, Finally got the Stringifier thing, Thanks. On the second bullet item, the serialized chain element JobConf (A) is deserialized, a new JobConf (B) with the task's JobConf as parent is created (to ensure all Hadoop runtime values are avail), the values of A are applied to B to ensure the config values of the chain element have precedence when the chain mapper/reducer gets its conf, B. On the third bullet item, Yes, you are right. 2 reasons, one is to keep code simpler, the second is that if a MultithreadedMapRunner is used a single pipeline will not work as it is not thread safe. On the fourth bullet item, this will go away as I'm using Stringifier in the next version of patch.
          Alejandro Abdelnur made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12387102 ]
          Alejandro Abdelnur made changes -
          Hadoop Flags [Incompatible change]
          Release Note The ChainMapper and the ChainReducer classes allow composing chains of Maps and Reduces in a single Map/Reduce job, something like MAP+ / REDUCE MAP*. An immediate benefit of this pattern is reduction in disk IO as many Maps can be club together in a single job.

          The Configuration write(OutputStream) method has been renamed to writeXml(OutputStream) to avoid ambiguity with the Writable write(DataOutput) method.
          The ChainMapper and the ChainReducer classes allow composing chains of Maps and Reduces in a single Map/Reduce job, something like MAP+ / REDUCE MAP*. An immediate benefit of this pattern is reduction in disk IO as many Maps can be club together in a single job.
          Hide
          Alejandro Abdelnur added a comment -

          Modified to use Stringifier as suggested, added Serialization impl for Configuration for that.

          This removes the 'incompatible change' introduced in the previous patch.

          Show
          Alejandro Abdelnur added a comment - Modified to use Stringifier as suggested, added Serialization impl for Configuration for that. This removes the 'incompatible change' introduced in the previous patch.
          Alejandro Abdelnur made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12387102/patch3702.txt
          against trunk revision 680577.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2972/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2972/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2972/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2972/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12387102/patch3702.txt against trunk revision 680577. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2972/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2972/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2972/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2972/console This message is automatically generated.
          Hide
          Enis Soztutar added a comment -
          1. Having a distinct serializer for Configuration is not desired. Configuration is already marshaled/unmarshaled to/from string form as XML. So having Configuration implement Writable is the choice here. There is no ambiguity in the Configuration.write() method. We can keep current write(OutputStream out) method and add :
            public void write(final DataOutput out) throws IOException {
                write(new OutputStream() {
                  @Override
                  public void write(int b) throws IOException {
                    out.writeByte(b);
                  }
                });
              }
            

            The readFields() can be implemented by factoring common functionality of loadProperties() method, instead of reading from a URL or file, the XML will be built using DataInput.

          2. It would be better if ChainMapper, ChainReducer, ChainOutputCollector to use generics, similar to Mapper, Reducer, and OutputCollector classes.
          3. creating a new ChainOutputCollector() at each collect() call might be suboptimal. Could this not be done once for each Map/Reduce task?
          Show
          Enis Soztutar added a comment - Having a distinct serializer for Configuration is not desired. Configuration is already marshaled/unmarshaled to/from string form as XML. So having Configuration implement Writable is the choice here. There is no ambiguity in the Configuration.write() method. We can keep current write(OutputStream out) method and add : public void write( final DataOutput out) throws IOException { write( new OutputStream() { @Override public void write( int b) throws IOException { out.writeByte(b); } }); } The readFields() can be implemented by factoring common functionality of loadProperties() method, instead of reading from a URL or file, the XML will be built using DataInput. It would be better if ChainMapper, ChainReducer, ChainOutputCollector to use generics, similar to Mapper, Reducer, and OutputCollector classes. creating a new ChainOutputCollector() at each collect() call might be suboptimal. Could this not be done once for each Map/Reduce task?
          Enis Soztutar made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Enis Soztutar made changes -
          Priority Minor [ 4 ] Major [ 3 ]
          Hide
          Alejandro Abdelnur added a comment -

          On #1, the ambiguity for the existing {write(OutputStream)}} and the write(DataOutput) introduced by the Writable interface arises when a DataOutputStream object is passed to the write(...) method, the compiler cannot resolve which one to use. (I've done this in one of the previous patches, my solution was changing the existing write() to writeXml() but it started breaking in different places in contrib as the method is used outside of core.

          On #2, it doesn't make much sense to generify the Chain* classes as they are not directly exposed to the M/R developer, they are artifacts used to enable chaining, the M/R developer doesn't code against them. If they would be generified it wold not add any value as at compile time nothing could be checked for them.

          On #3, ChainOutputCollector is lightweight class, see my answer to Chris' regarding a similar question.

          Show
          Alejandro Abdelnur added a comment - On #1 , the ambiguity for the existing {write(OutputStream)}} and the write(DataOutput) introduced by the Writable interface arises when a DataOutputStream object is passed to the write(...) method, the compiler cannot resolve which one to use. (I've done this in one of the previous patches, my solution was changing the existing write() to writeXml() but it started breaking in different places in contrib as the method is used outside of core. On #2 , it doesn't make much sense to generify the Chain* classes as they are not directly exposed to the M/R developer, they are artifacts used to enable chaining, the M/R developer doesn't code against them. If they would be generified it wold not add any value as at compile time nothing could be checked for them. On #3 , ChainOutputCollector is lightweight class, see my answer to Chris' regarding a similar question.
          Hide
          Enis Soztutar added a comment -
          1. obviously, casting the DataOutputStream to OutputStream would resolve the ambiguity. All these kind of references could be changed (if any) to first cast the reference.
            DataOutputStream dos = ...
            ...
            conf.write((OutputStream)dos);
            ...
            
          2. Although the Chain classes are not exposed, it is always good practice to use the generics. We can get rid of @SuppressWarnings( {"unchecked"}

            ) annotations.

          1. If you are confident that this does not introduce extra overhead, I'm fine with this.
          Show
          Enis Soztutar added a comment - obviously, casting the DataOutputStream to OutputStream would resolve the ambiguity. All these kind of references could be changed (if any) to first cast the reference. DataOutputStream dos = ... ... conf.write((OutputStream)dos); ... Although the Chain classes are not exposed, it is always good practice to use the generics. We can get rid of @SuppressWarnings( {"unchecked"} ) annotations. If you are confident that this does not introduce extra overhead, I'm fine with this.
          Hide
          Alejandro Abdelnur added a comment -

          On #1, Yes but this introduces a backwards incompatibility as there is code out there (outside of core) that uses the write(OutputStream) with DataOuputStream instances and the change is breaking such usages (as the previous patch found).

          On #2, I'm missing something here, under what circumstances would it make sense to use the Chain* classes with generics that would be checked at compile time? If it is just a way of avoiding the @SuppressWarnings annotations I'd prefer the anotations as, IMO, they are meant for this cases.

          Show
          Alejandro Abdelnur added a comment - On #1 , Yes but this introduces a backwards incompatibility as there is code out there (outside of core) that uses the write(OutputStream) with DataOuputStream instances and the change is breaking such usages (as the previous patch found). On #2 , I'm missing something here, under what circumstances would it make sense to use the Chain* classes with generics that would be checked at compile time? If it is just a way of avoiding the @SuppressWarnings annotations I'd prefer the anotations as, IMO, they are meant for this cases.
          Hide
          Enis Soztutar added a comment -

          On #1, Yes but this introduces a backwards incompatibility as there is code out there (outside of core) that uses the write(OutputStream) with DataOuputStream instances and the change is breaking such usages (as the previous patch found).

          Yes, at some level it will introduce backwards incompatibility. But thinking in abstract terms, having func(Interface1 i1), and introducing func(Interface2 i2) is not a direct incompatibility. Only those calls where the object both implements Interface1 and Interface2 would be affected(in this case DataOutputStream), and there is a clear workaround for this. I think introducing a Serialization for Configuration is far more worse. I suggest we keep write(OutputStream), introduce write(DataOutput), fix all the cases where DataOutputStream is passed (including contrib), and mark this change as incompatible with clear documentation in release note.

          On #2, I'm missing something here, under what circumstances would it make sense to use the Chain* classes with generics that would be checked at compile time? If it is just a way of avoiding the @SuppressWarnings annotations I'd prefer the anotations as, IMO, they are meant for this cases.

          @SuppressWarnings annotations are just "hacks" for the compiler to stop complaining. There are valid reasons for the compiler to issue warnings, and instead of fixing them, we say the compiler to ignore these, which is not desired.
          The motivation to use generics is the same as the one for the use of generics in Mapper, Reducer, etc. I guess with some little extra effort, we could make this change, no?

          Show
          Enis Soztutar added a comment - On #1, Yes but this introduces a backwards incompatibility as there is code out there (outside of core) that uses the write(OutputStream) with DataOuputStream instances and the change is breaking such usages (as the previous patch found). Yes, at some level it will introduce backwards incompatibility. But thinking in abstract terms, having func(Interface1 i1) , and introducing func(Interface2 i2) is not a direct incompatibility. Only those calls where the object both implements Interface1 and Interface2 would be affected(in this case DataOutputStream), and there is a clear workaround for this. I think introducing a Serialization for Configuration is far more worse. I suggest we keep write(OutputStream), introduce write(DataOutput), fix all the cases where DataOutputStream is passed (including contrib), and mark this change as incompatible with clear documentation in release note. On #2, I'm missing something here, under what circumstances would it make sense to use the Chain* classes with generics that would be checked at compile time? If it is just a way of avoiding the @SuppressWarnings annotations I'd prefer the anotations as, IMO, they are meant for this cases. @SuppressWarnings annotations are just "hacks" for the compiler to stop complaining. There are valid reasons for the compiler to issue warnings, and instead of fixing them, we say the compiler to ignore these, which is not desired. The motivation to use generics is the same as the one for the use of generics in Mapper, Reducer, etc. I guess with some little extra effort, we could make this change, no?
          Hide
          Alejandro Abdelnur added a comment -

          On #1, I don't have a problem either way, I was just trying to be as less disruptive as possible. Somebody has to make the call here.

          On #2, I've tried to generify ChainOutputCollector but hitting a wall there, I never have the types to instantiate it so I keep getting the compiler warnings. Thoughts?

          Show
          Alejandro Abdelnur added a comment - On #1 , I don't have a problem either way, I was just trying to be as less disruptive as possible. Somebody has to make the call here. On #2 , I've tried to generify ChainOutputCollector but hitting a wall there, I never have the types to instantiate it so I keep getting the compiler warnings. Thoughts?
          Hide
          Alejandro Abdelnur added a comment -

          Enis,

          I have not heard anything back from you on this.

          Show
          Alejandro Abdelnur added a comment - Enis, I have not heard anything back from you on this.
          Hide
          Devaraj Das added a comment -

          I think introducing a Serialization for Configuration is far more worse.

          Enis, could you please expand on this one? It is not clear to me how this is worse..

          Regarding the generics signature for the chain* classes, I think it is okay to not have generics since this class's key/value types are driven by the first/last map/reduce's key/value type in the chain.

          Show
          Devaraj Das added a comment - I think introducing a Serialization for Configuration is far more worse. Enis, could you please expand on this one? It is not clear to me how this is worse.. Regarding the generics signature for the chain* classes, I think it is okay to not have generics since this class's key/value types are driven by the first/last map/reduce's key/value type in the chain.
          Hide
          Enis Soztutar added a comment -

          I am attaching a patch which is a modified version of the last one.

          I have added generic arguments where applicable. But not all the instances could be generified, since we cannot know the arguments. This is because we do not know the map input and output types after the first map and before the last map. At least ChainReducer and ChainMapper are now generified.

          Implementing a Serializer for a specific class does not fit into the design of Serializer. Serializer is intended to be defined once for each set of classes (implemeting Writable, Serializable, etc.).
          In the attached patch WritableJobConf extends JobConf and implements Writable. Chain uses this class to ser/deser the configuration. And this does not introduce an incompatibility, since Configuration/JobConf does not implement Writable.

          Show
          Enis Soztutar added a comment - I am attaching a patch which is a modified version of the last one. I have added generic arguments where applicable. But not all the instances could be generified, since we cannot know the arguments. This is because we do not know the map input and output types after the first map and before the last map. At least ChainReducer and ChainMapper are now generified. Implementing a Serializer for a specific class does not fit into the design of Serializer. Serializer is intended to be defined once for each set of classes (implemeting Writable, Serializable, etc.). In the attached patch WritableJobConf extends JobConf and implements Writable. Chain uses this class to ser/deser the configuration. And this does not introduce an incompatibility, since Configuration/JobConf does not implement Writable.
          Enis Soztutar made changes -
          Attachment Hadoop-3702.patch [ 12388256 ]
          Enis Soztutar made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Enis Soztutar made changes -
          Link This issue incorporates HADOOP-3927 [ HADOOP-3927 ]
          Hide
          Alejandro Abdelnur added a comment -

          [apologies for the delay following up on this, I was off all last week]

          On using generics

          Enis, I don't think the use of generics in your proposed patch is correct. Let me try to explain.

          First reason:

          The intended normal use of ChainMapper and ChainReducer is via configuration, i.e.:

            Jobconf conf = ...
            ChainMapper.addMapper(conf, ...);
            ChainMapper.addMapper(conf, ...);
            ChainMapper.addMapper(conf, ...);
            ...
          

          Making ChainMapper<K1, V1, K2, V2> does not make sense in this case as a developer is never instantiating it. Thus there is not type checking being done at compilation here for K1, V1, K2, V2 .

          Second reason:

          Even if you would do create an instance of ChainMapper bound to concrete classes for K1, V1, K2, V2 it would work for the common case of mappers in the chain using different key/value classes.

          See, the contract between the maps in a chain is that the input key/values classes of the first mapper are the same as the input key/values classes of the job, the input key/values classes of the second mapper are the same as the output key/values classes of the first mapper, and so on, and the output key/value classes of the last mapper in the chain (for the ChainMapper) are the same as the input key/values classes of the reducer.

          For example: take a job that the map input/output classes are K1,V1,K2,V2 you can have the following chain:

            Jobconf conf = ...
            ChainMapper.addMapper(conf, AMap.class, K1.class, V1.class, Ka.class, Va.class, null);
            ChainMapper.addMapper(conf, BMap.class, Ka.class, Va.class, Kb.class, Vb.class, null);
            ChainMapper.addMapper(conf, CMap.class, Kb.class, Vb.class, K2.class, V2.class, null);
          

          On using a Serializer for Configuration

          Note that the Serializer is for Configuration and subclasses, it is not bound to Configuration.

          I'm OK with your proposed patch here.

          Show
          Alejandro Abdelnur added a comment - [apologies for the delay following up on this, I was off all last week] On using generics Enis, I don't think the use of generics in your proposed patch is correct. Let me try to explain. First reason: The intended normal use of ChainMapper and ChainReducer is via configuration, i.e.: Jobconf conf = ... ChainMapper.addMapper(conf, ...); ChainMapper.addMapper(conf, ...); ChainMapper.addMapper(conf, ...); ... Making ChainMapper<K1, V1, K2, V2> does not make sense in this case as a developer is never instantiating it. Thus there is not type checking being done at compilation here for K1, V1, K2, V2 . Second reason: Even if you would do create an instance of ChainMapper bound to concrete classes for K1, V1, K2, V2 it would work for the common case of mappers in the chain using different key/value classes. See, the contract between the maps in a chain is that the input key/values classes of the first mapper are the same as the input key/values classes of the job, the input key/values classes of the second mapper are the same as the output key/values classes of the first mapper, and so on, and the output key/value classes of the last mapper in the chain (for the ChainMapper ) are the same as the input key/values classes of the reducer. For example: take a job that the map input/output classes are K1,V1,K2,V2 you can have the following chain: Jobconf conf = ... ChainMapper.addMapper(conf, AMap.class, K1.class, V1.class, Ka.class, Va.class, null ); ChainMapper.addMapper(conf, BMap.class, Ka.class, Va.class, Kb.class, Vb.class, null ); ChainMapper.addMapper(conf, CMap.class, Kb.class, Vb.class, K2.class, V2.class, null ); On using a Serializer for Configuration Note that the Serializer is for Configuration and subclasses, it is not bound to Configuration . I'm OK with your proposed patch here.
          Alejandro Abdelnur made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Alejandro Abdelnur added a comment -

          This patch uses Enis' WritableJobConf instead making a Serialization for Configuration.

          But it does not generify the Chain classes (per previous comment).

          Show
          Alejandro Abdelnur added a comment - This patch uses Enis' WritableJobConf instead making a Serialization for Configuration . But it does not generify the Chain classes (per previous comment).
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12389067 ]
          Alejandro Abdelnur made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Enis Soztutar added a comment -

          Making ChainMapper<K1, V1, K2, V2> does not make sense in this case as a developer is never instantiating it. Thus there is not type checking being done at compilation here for K1, V1, K2, V2 .

          It does not matter whether the classes are instantiated by the user or not, it is better to have them use generics properly as much as we can, so that we can get rid of the @SuppressWarnings("unchecked"). IndentityMapper is also never instantiated by the user.

          Even if you would do create an instance of ChainMapper bound to concrete classes for K1, V1, K2, V2 it would work for the common case of mappers in the chain using different key/value classes.See, the contract between the maps in a chain is that the input key/values classes of the first mapper are the same as the input key/values classes of the job, the input key/values classes of the second mapper are the same as the output key/values classes of the first mapper, and so on, and the output key/value classes of the last mapper in the chain (for the ChainMapper) are the same as the input key/values classes of the reducer.

          Yes I know that the whole dataflow would be

          (K1, V1)  -> (Ka, Va) -> ... -> (K2, V2) -> (k3, V3) 
          

          where first mapper takes K1, V1, and last mapper outputs K2, V2, and the patch checks for that.
          For example Chain#mappers are Mapper<?,?,?,?> does not enforce the mappers to use the same input / output classes, and ChainOutputCollector uses <K,V,KO,VO> so that its input and output classes are enforced (but unfortunately we cannot instantiate ChainOutputCollectors with bound parameters. ).
          Chain#addMapper() is <i>static</i> so that its generic <K1,V1,K2,V2> parameters are not bound by the Chain<K1,...V3> parameters. So you can again use

          Jobconf conf = ...
            ChainMapper.addMapper(conf, AMap.class, K1.class, V1.class, Ka.class, Va.class, null);
            ChainMapper.addMapper(conf, BMap.class, Ka.class, Va.class, Kb.class, Vb.class, null);
            ChainMapper.addMapper(conf, CMap.class, Kb.class, Vb.class, K2.class, V2.class, null);
          

          My version of the patch uses a similar approach to JobConf, in that we try to use generics as much as possible, which is, I think, should be the preferred way.

          Show
          Enis Soztutar added a comment - Making ChainMapper<K1, V1, K2, V2> does not make sense in this case as a developer is never instantiating it. Thus there is not type checking being done at compilation here for K1, V1, K2, V2 . It does not matter whether the classes are instantiated by the user or not, it is better to have them use generics properly as much as we can, so that we can get rid of the @SuppressWarnings("unchecked"). IndentityMapper is also never instantiated by the user. Even if you would do create an instance of ChainMapper bound to concrete classes for K1, V1, K2, V2 it would work for the common case of mappers in the chain using different key/value classes.See, the contract between the maps in a chain is that the input key/values classes of the first mapper are the same as the input key/values classes of the job, the input key/values classes of the second mapper are the same as the output key/values classes of the first mapper, and so on, and the output key/value classes of the last mapper in the chain (for the ChainMapper) are the same as the input key/values classes of the reducer. Yes I know that the whole dataflow would be (K1, V1) -> (Ka, Va) -> ... -> (K2, V2) -> (k3, V3) where first mapper takes K1, V1, and last mapper outputs K2, V2, and the patch checks for that. For example Chain#mappers are Mapper<?,?,?,?> does not enforce the mappers to use the same input / output classes, and ChainOutputCollector uses <K,V,KO,VO> so that its input and output classes are enforced (but unfortunately we cannot instantiate ChainOutputCollectors with bound parameters. ). Chain#addMapper() is <i>static</i> so that its generic <K1,V1,K2,V2> parameters are not bound by the Chain<K1,...V3> parameters. So you can again use Jobconf conf = ... ChainMapper.addMapper(conf, AMap.class, K1.class, V1.class, Ka.class, Va.class, null ); ChainMapper.addMapper(conf, BMap.class, Ka.class, Va.class, Kb.class, Vb.class, null ); ChainMapper.addMapper(conf, CMap.class, Kb.class, Vb.class, K2.class, V2.class, null ); My version of the patch uses a similar approach to JobConf, in that we try to use generics as much as possible, which is, I think, should be the preferred way.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12389067/patch3702.txt
          against trunk revision 689733.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3136/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3136/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3136/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3136/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12389067/patch3702.txt against trunk revision 689733. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3136/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3136/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3136/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3136/console This message is automatically generated.
          Hide
          Alejandro Abdelnur added a comment -

          failed test seems unrelated to this patch.

          Show
          Alejandro Abdelnur added a comment - failed test seems unrelated to this patch.
          Alejandro Abdelnur made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Alejandro Abdelnur added a comment -

          I've got Enis point of making use of generics in the addMapper and setReducer methods in the Chain* classes. This enforces a compile time type check between the defined key/value in/out classes and the mapper/reducer classes when defining the chain. I've modified the patch to do so.

          An additional question on this. Currently the addMapper and setReducer signatures are like:

            public static <K1, V1, K2, V2> void addMapper(boolean isMap, JobConf jobConf,
                                     Class<? extends Mapper<K1, V1,  K2,  V2>> klass,
                                     Class<K1> inputKeyClass, Class<V1> inputValueClass,
                                     Class<K2> outputKeyClass, Class<V2> outputValueClass,
                                     boolean byValue, JobConf mapperConf) {
          

          Shouldn't be appropriate to make them?

            public static <K1, V1, K2, V2> void addMapper(boolean isMap, JobConf jobConf,
                                     Class<? extends Mapper<? extends K1,? extends  V1,? extends  K2,? extends  V2>> klass,
                                     Class<K1> inputKeyClass, Class<V1> inputValueClass,
                                     Class<K2> outputKeyClass, Class<V2> outputValueClass,
                                     boolean byValue, JobConf mapperConf) {
          

          Still, I don't agree on making the Chain* classes generic. Their generics would not be use for any type checking at compilation. And it is not possible to get rid of the @SuppressWarnings("unchecked") .

          Show
          Alejandro Abdelnur added a comment - I've got Enis point of making use of generics in the addMapper and setReducer methods in the Chain* classes. This enforces a compile time type check between the defined key/value in/out classes and the mapper/reducer classes when defining the chain. I've modified the patch to do so. An additional question on this. Currently the addMapper and setReducer signatures are like: public static <K1, V1, K2, V2> void addMapper( boolean isMap, JobConf jobConf, Class <? extends Mapper<K1, V1, K2, V2>> klass, Class <K1> inputKeyClass, Class <V1> inputValueClass, Class <K2> outputKeyClass, Class <V2> outputValueClass, boolean byValue, JobConf mapperConf) { Shouldn't be appropriate to make them? public static <K1, V1, K2, V2> void addMapper( boolean isMap, JobConf jobConf, Class <? extends Mapper<? extends K1,? extends V1,? extends K2,? extends V2>> klass, Class <K1> inputKeyClass, Class <V1> inputValueClass, Class <K2> outputKeyClass, Class <V2> outputValueClass, boolean byValue, JobConf mapperConf) { Still, I don't agree on making the Chain* classes generic. Their generics would not be use for any type checking at compilation. And it is not possible to get rid of the @SuppressWarnings("unchecked") .
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12389247 ]
          Alejandro Abdelnur made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Fix Version/s 0.19.0 [ 12313211 ]
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12389247/patch3702.txt
          against trunk revision 691055.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3152/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3152/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3152/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3152/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12389247/patch3702.txt against trunk revision 691055. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3152/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3152/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3152/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3152/console This message is automatically generated.
          Hide
          Enis Soztutar added a comment -

          Still, I don't agree on making the Chain* classes generic. Their generics would not be use for any type checking at compilation. And it is not possible to get rid of the @SuppressWarnings("unchecked") .

          I do not wish to further delay the patch for this. I'm OK with the current patch, although the reasons for using generics remain.

          Shouldn't be appropriate to make them?...

          Sigh. yes, since in generics, G<Foo> is not a subtype of G<Bar> even if Foo extends Bar. So please make that change.

          Show
          Enis Soztutar added a comment - Still, I don't agree on making the Chain* classes generic. Their generics would not be use for any type checking at compilation. And it is not possible to get rid of the @SuppressWarnings("unchecked") . I do not wish to further delay the patch for this. I'm OK with the current patch, although the reasons for using generics remain. Shouldn't be appropriate to make them?... Sigh. yes, since in generics, G<Foo> is not a subtype of G<Bar> even if Foo extends Bar. So please make that change.
          Hide
          Alejandro Abdelnur added a comment -

          Thanks Enis,

          On the use of generics in the addMapper/setReducer which one of the 3 options is the most appropriate? I would be inclined to think that the third one. My reasoning is that on the input the mapper/reducer must handle at least a subclass of the key/value classes, and on the output the mapper/reducer define what the key/value classes are.

          Option 1

          public static <K1, V1, K2, V2> void addMapper(boolean isMap, JobConf jobConf,
                                     Class<? extends Mapper<? extends K1,? extends  V1,? extends  K2,? extends  V2>> klass,
                                     Class<K1> inputKeyClass, Class<V1> inputValueClass,
                                     Class<K2> outputKeyClass, Class<V2> outputValueClass,
                                     boolean byValue, JobConf mapperConf) {
          

          Option 2

          public static <K1, V1, K2, V2> void addMapper(boolean isMap, JobConf jobConf,
                                     Class<? extends Mapper<K1, V1, K2, V2>> klass,
                                     Class<? extends K1> inputKeyClass, Class<? extends V1> inputValueClass,
                                     Class<? extends K2> outputKeyClass, Class<? extends V2> outputValueClass,
                                     boolean byValue, JobConf mapperConf) {
          

          Option 3

          public static <K1, V1, K2, V2> void addMapper(boolean isMap, JobConf jobConf,
                                     Class<? extends Mapper<K1, V1, K2, V2>> klass,
                                     Class<? extends K1> inputKeyClass, Class<? extends V1> inputValueClass,
                                     Class<K2> outputKeyClass, Class<V2> outputValueClass,
                                     boolean byValue, JobConf mapperConf) {
          
          Show
          Alejandro Abdelnur added a comment - Thanks Enis, On the use of generics in the addMapper/setReducer which one of the 3 options is the most appropriate? I would be inclined to think that the third one. My reasoning is that on the input the mapper/reducer must handle at least a subclass of the key/value classes, and on the output the mapper/reducer define what the key/value classes are. Option 1 public static <K1, V1, K2, V2> void addMapper( boolean isMap, JobConf jobConf, Class <? extends Mapper<? extends K1,? extends V1,? extends K2,? extends V2>> klass, Class <K1> inputKeyClass, Class <V1> inputValueClass, Class <K2> outputKeyClass, Class <V2> outputValueClass, boolean byValue, JobConf mapperConf) { Option 2 public static <K1, V1, K2, V2> void addMapper( boolean isMap, JobConf jobConf, Class <? extends Mapper<K1, V1, K2, V2>> klass, Class <? extends K1> inputKeyClass, Class <? extends V1> inputValueClass, Class <? extends K2> outputKeyClass, Class <? extends V2> outputValueClass, boolean byValue, JobConf mapperConf) { Option 3 public static <K1, V1, K2, V2> void addMapper( boolean isMap, JobConf jobConf, Class <? extends Mapper<K1, V1, K2, V2>> klass, Class <? extends K1> inputKeyClass, Class <? extends V1> inputValueClass, Class <K2> outputKeyClass, Class <V2> outputValueClass, boolean byValue, JobConf mapperConf) {
          Hide
          Enis Soztutar added a comment -

          Lets see,
          if we have

          class B extends A 
          

          then Mapper<A, X, X, X> can accept Class<B> as inputkeyclass, but not the other way around, right?
          In this case Option 1 and 3 are not correct.
          Option2 is good, I guess. And no need to limit the output key/value types(the input key class can be the same as output key class).

          Show
          Enis Soztutar added a comment - Lets see, if we have class B extends A then Mapper<A, X, X, X> can accept Class<B> as inputkeyclass, but not the other way around, right? In this case Option 1 and 3 are not correct. Option2 is good, I guess. And no need to limit the output key/value types(the input key class can be the same as output key class).
          Hide
          Alejandro Abdelnur added a comment -

          I'm confused, why option 3 would not work with your example.

          My reasoning for Option 3 is that as input the mapper/reducer can take process a less specific class of input key/value classes, but the output cannot be less (or more) specific because the concrete class of the output is chosen to get the serializer.

          Classes: A, B extends A, C extends B

          Input Key/Value classes: {{ C, B}}
          Output Key/Value classes: {{ A, B}}

          Valid mapper classes: Mapper<C, B, A, B> Mapper<C, A, A, B> Mapper<B, B, A, B> Mapper<B, A, A, B> Mapper<A, B A, B>, Mapper<A, A, A, B>

          Invalid mapper classes: Mapper<*, *, B, B> (and any other mapper were the output is not A, B )

          Show
          Alejandro Abdelnur added a comment - I'm confused, why option 3 would not work with your example. My reasoning for Option 3 is that as input the mapper/reducer can take process a less specific class of input key/value classes, but the output cannot be less (or more) specific because the concrete class of the output is chosen to get the serializer. Classes: A, B extends A, C extends B Input Key/Value classes: {{ C, B}} Output Key/Value classes: {{ A, B}} Valid mapper classes: Mapper<C, B, A, B> Mapper<C, A, A, B> Mapper<B, B, A, B> Mapper<B, A, A, B> Mapper<A, B A, B> , Mapper<A, A, A, B> Invalid mapper classes: Mapper<*, *, B, B> (and any other mapper were the output is not A, B )
          Hide
          Enis Soztutar added a comment -

          Ok, I confused option 3 as a derivative of 1, which clearly is a derivative of 2.
          However the output classes can indeed be more specific than declared. An example of this is a mapper which does not change the input key, and directly outputs it.

          The following code fragment, demonstrates that we should go with the option 2.

            static class A  {
            }
            static class B extends A {
            }
            static class AMapper extends MapReduceBase implements Mapper<A, Text, A, Text> {
              @Override
              public void map(A key, Text value, OutputCollector<A, Text> output,
                  Reporter reporter) throws IOException {
              }
            }
            static class BMapper extends MapReduceBase implements Mapper<B, Text, B, Text> {
              @Override
              public void map(B key, Text value, OutputCollector<B, Text> output,
                  Reporter reporter) throws IOException {
              }
            }
            static {
              JobConf job = new JobConf();
              addMapper(true, job, AMapper.class, A.class, Text.class, A.class, Text.class, false, job);
              addMapper(true, job, AMapper.class, B.class, Text.class, B.class, Text.class, false, job);
              addMapper(true, job, BMapper.class, A.class, Text.class, A.class, Text.class, false, job);//should fail
              addMapper(true, job, BMapper.class, B.class, Text.class, B.class, Text.class, false, job);
            }
          
          Show
          Enis Soztutar added a comment - Ok, I confused option 3 as a derivative of 1, which clearly is a derivative of 2. However the output classes can indeed be more specific than declared. An example of this is a mapper which does not change the input key, and directly outputs it. The following code fragment, demonstrates that we should go with the option 2. static class A { } static class B extends A { } static class AMapper extends MapReduceBase implements Mapper<A, Text, A, Text> { @Override public void map(A key, Text value, OutputCollector<A, Text> output, Reporter reporter) throws IOException { } } static class BMapper extends MapReduceBase implements Mapper<B, Text, B, Text> { @Override public void map(B key, Text value, OutputCollector<B, Text> output, Reporter reporter) throws IOException { } } static { JobConf job = new JobConf(); addMapper( true , job, AMapper.class, A.class, Text.class, A.class, Text.class, false , job); addMapper( true , job, AMapper.class, B.class, Text.class, B.class, Text.class, false , job); addMapper( true , job, BMapper.class, A.class, Text.class, A.class, Text.class, false , job); //should fail addMapper( true , job, BMapper.class, B.class, Text.class, B.class, Text.class, false , job); }
          Hide
          Alejandro Abdelnur added a comment -

          With option 3 the //should fail line fails.

          Your comment However the output classes can indeed be more specific than declared. An example of this is a mapper which does not change the input key, and directly outputs it. is correct, nasty but correct.

          This bring then another question, currently when adding a mapper to the chain the addMapper method checks that the input key/value classes match (being equal, a ==) the output key/value classes of the previous mapper. Should this be changed to an isAssignable()? Or is it too much?

          Show
          Alejandro Abdelnur added a comment - With option 3 the //should fail line fails. Your comment However the output classes can indeed be more specific than declared. An example of this is a mapper which does not change the input key, and directly outputs it. is correct, nasty but correct. This bring then another question, currently when adding a mapper to the chain the addMapper method checks that the input key/value classes match (being equal, a ==) the output key/value classes of the previous mapper. Should this be changed to an isAssignable() ? Or is it too much?
          Alejandro Abdelnur made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Alejandro Abdelnur added a comment -

          Changing signature of addMapper and setReducer methods to use ? extends T for the input/output key/value classes.

          Changing check of chained input/output key/value classes to use isAssignable instead of using equals.

          Show
          Alejandro Abdelnur added a comment - Changing signature of addMapper and setReducer methods to use ? extends T for the input/output key/value classes. Changing check of chained input/output key/value classes to use isAssignable instead of using equals .
          Alejandro Abdelnur made changes -
          Attachment patch3702.txt [ 12389651 ]
          Alejandro Abdelnur made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12389651/patch3702.txt
          against trunk revision 692700.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3202/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3202/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3202/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3202/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12389651/patch3702.txt against trunk revision 692700. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3202/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3202/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3202/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3202/console This message is automatically generated.
          Hide
          Enis Soztutar added a comment -

          +1

          Show
          Enis Soztutar added a comment - +1
          Hide
          Devaraj Das added a comment -

          I just committed this. Thanks, Alejandro!

          Show
          Devaraj Das added a comment - I just committed this. Thanks, Alejandro!
          Devaraj Das made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hadoop Flags [Reviewed]
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #600 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/600/ )
          Robert Chansler made changes -
          Release Note The ChainMapper and the ChainReducer classes allow composing chains of Maps and Reduces in a single Map/Reduce job, something like MAP+ / REDUCE MAP*. An immediate benefit of this pattern is reduction in disk IO as many Maps can be club together in a single job.
          Introduced ChainMapper and the ChainReducer classes to allow composing chains of Maps and Reduces in a single Map/Reduce job, something like MAP+ REDUCE MAP*.
          Nigel Daley made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Owen O'Malley made changes -
          Component/s mapred [ 12310690 ]

            People

            • Assignee:
              Alejandro Abdelnur
              Reporter:
              Alejandro Abdelnur
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development