Hadoop Common
  1. Hadoop Common
  2. HADOOP-3149

supporting multiple outputs for M/R jobs

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.19.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      all

    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      Introduced MultipleOutputs class so Map/Reduce jobs can write data to different output files. Each output can use a different OutputFormat. Outpufiles are created within the job output directory. FileOutputFormat.getPathForCustomFile() creates a filename under the outputdir that is named with the task ID and task type (i.e. myfile-r-00001).
      Show
      Introduced MultipleOutputs class so Map/Reduce jobs can write data to different output files. Each output can use a different OutputFormat. Outpufiles are created within the job output directory. FileOutputFormat.getPathForCustomFile() creates a filename under the outputdir that is named with the task ID and task type (i.e. myfile-r-00001).

      Description

      The outputcollector supports writing data to a single output, the 'part' files in the output path.

      We found quite common that our M/R jobs have to write data to different output. For example when classifying data as NEW, UPDATE, DELETE, NO-CHANGE to later do different processing on it.

      Handling the initialization of additional outputs from within the M/R code complicates the code and is counter intuitive with the notion of job configuration.

      It would be desirable to:

      1. Configure the additional outputs in the jobconf, potentially specifying different outputformats, key and value classes for each one.
      2. Write to the additional outputs in a similar way as data is written to the outputcollector.
      3. Support the speculative execution semantics for the output files, only visible in the final output for promoted tasks.

      To support multiple outputs the following classes would be added to mapred/lib:

      • MOJobConf : extends JobConf adding methods to define named outputs (name, outputformat, key class, value class)
      • MOOutputCollector : extends OutputCollector adding a collect(String outputName, WritableComparable key, Writable value) method.
      • MOMapper and MOReducer: implement Mapper and Reducer adding a new configure, map and reduce signature that take the corresponding MO classes and performs the proper initialization.

      The data flow behavior would be: key/values written to the default (unnamed) output (using the original OutputCollector collect signature) take part of the shuffle/sort/reduce processing phases. key/values written to a named output from within a map don't.

      The named output files would be named using the task type and task ID to avoid collision among tasks (i.e. 'new-m-00002' and 'new-r-00001').

      Together with the setInputPathFilter feature introduced by HADOOP-2055 it would become very easy to chain jobs working on particular named outputs within a single directory.

      We are using heavily this pattern and it greatly simplified our M/R code as well as chaining different M/R.

      We wanted to contribute this back to Hadoop as we think is a generic feature many could benefit from.

      1. patch3149.txt
        32 kB
        Alejandro Abdelnur
      2. patch3149.txt
        32 kB
        Alejandro Abdelnur
      3. patch3149.txt
        32 kB
        Alejandro Abdelnur
      4. patch3149.txt
        32 kB
        Alejandro Abdelnur
      5. patch3149.txt
        33 kB
        Alejandro Abdelnur
      6. patch3149.txt
        33 kB
        Alejandro Abdelnur
      7. patch3149.txt
        19 kB
        Alejandro Abdelnur
      8. patch3149.txt
        19 kB
        Alejandro Abdelnur
      9. patch3149.txt
        19 kB
        Alejandro Abdelnur
      10. patch3149.txt
        20 kB
        Alejandro Abdelnur
      11. patch3149.txt
        19 kB
        Alejandro Abdelnur
      12. patch3149.txt
        19 kB
        Alejandro Abdelnur
      13. patch3149.txt
        22 kB
        Alejandro Abdelnur
      14. patch3149.txt
        33 kB
        Alejandro Abdelnur
      15. patch3149.txt
        33 kB
        Alejandro Abdelnur
      16. patch3149.txt
        29 kB
        Alejandro Abdelnur

        Issue Links

          Activity

          Hide
          Runping Qi added a comment -

          I think the multiple output format class in https://issues.apache.org/jira/browse/HADOOP-2906
          should fit your need well.
          With that class, you can write different key/value pairs to different output files, all under your control.

          Show
          Runping Qi added a comment - I think the multiple output format class in https://issues.apache.org/jira/browse/HADOOP-2906 should fit your need well. With that class, you can write different key/value pairs to different output files, all under your control.
          Hide
          Alejandro Abdelnur added a comment -

          Runping,

          The MultipleOutpuFormat of HADOOP-2906 is more general that what HADOOP-3149 tries to solve, and because of that it requires more coding and handling from the Map/Reduce code: 1. the developer has to take care of avoiding file collision, the developer has to take care of closing the MultipleOutputFormat files. it is not clear how you define the classes for the key/value for SequenceFiles.

          The MO* API allows to easily define named-outputs with its own configuration (output-format, key class, value class) and use them on the usual collector way. The MO code takes care of namespacing the files names, creating and closing them.

          Show
          Alejandro Abdelnur added a comment - Runping, The MultipleOutpuFormat of HADOOP-2906 is more general that what HADOOP-3149 tries to solve, and because of that it requires more coding and handling from the Map/Reduce code: 1. the developer has to take care of avoiding file collision, the developer has to take care of closing the MultipleOutputFormat files. it is not clear how you define the classes for the key/value for SequenceFiles. The MO* API allows to easily define named-outputs with its own configuration (output-format, key class, value class) and use them on the usual collector way. The MO code takes care of namespacing the files names, creating and closing them.
          Hide
          Owen O'Malley added a comment -

          Alejandro, your patch is more complicated and less general than the already committed patch. I would suggest that you write a subclass of HADOOP-2906's MultipleOutputFormat.

          Show
          Owen O'Malley added a comment - Alejandro, your patch is more complicated and less general than the already committed patch. I would suggest that you write a subclass of HADOOP-2906 's MultipleOutputFormat.
          Hide
          Owen O'Malley added a comment -

          Looking over your and Runping's patches, I'd suggest defining a subclass that looks like:

          package org.apache.hadoop.mapred.lib;
          public class KeyValue<K,V> {
               private K key;
               private V value;
               public KeyValue();
               public KeyValue(K key, V value);
               public K getKey() ;
               public V getValue();
               public void setKey(K k); 
               public void setValue(V v);
           }
          
          public class MultipleOutputStreams extends MultipleOutputFormat {
            // modifiy job conf to control how format a given stream
            // should be called once for each stream kind
            public static void addOutputStream(JobConf conf,
                                                                            String kind, 
                                                                            Class<? extends OutputFormat> outFormat,
                                                                            Class<?> keyClass,
                                                                            Class<?> valueClass);
          }
          

          So client code would look like:

          In launcher:
             MultipleOutputStreams.addOutputStream(job, "foo", SequenceFileOutputFormat.class, Text.class, IntegerWritable.class);
             MultipleOutputStreams.addOutputStream(job, "bar", TextOutputFormat.class, Text.class, Text.class);
          
          In reducer:
            out.collect("foo", new KeyValue(new Text("hi"), new IntegerWritable(12));
            out.collect("bar", new KeyValue(k2, v2));
          
          Show
          Owen O'Malley added a comment - Looking over your and Runping's patches, I'd suggest defining a subclass that looks like: package org.apache.hadoop.mapred.lib; public class KeyValue<K,V> { private K key; private V value; public KeyValue(); public KeyValue(K key, V value); public K getKey() ; public V getValue(); public void setKey(K k); public void setValue(V v); } public class MultipleOutputStreams extends MultipleOutputFormat { // modifiy job conf to control how format a given stream // should be called once for each stream kind public static void addOutputStream(JobConf conf, String kind, Class <? extends OutputFormat> outFormat, Class <?> keyClass, Class <?> valueClass); } So client code would look like: In launcher: MultipleOutputStreams.addOutputStream(job, "foo" , SequenceFileOutputFormat.class, Text.class, IntegerWritable.class); MultipleOutputStreams.addOutputStream(job, "bar" , TextOutputFormat.class, Text.class, Text.class); In reducer: out.collect( "foo" , new KeyValue( new Text( "hi" ), new IntegerWritable(12)); out.collect( "bar" , new KeyValue(k2, v2));
          Hide
          Alejandro Abdelnur added a comment -

          Owen, let me try filling the gaps in your suggestion.

          1. MultipleOutputFormats extends MultipleOutputFormat providing a static method to define named outputs in the job conf including outputformat, key and value classes.
          2. MultipleOutputFormats will also implement the logic to create the per task file names based on the given named output.
          3. A MultipleOutputCollector will provide a collect(String namedOutput, WritableComparable key, Writable value) method.
          4. MultipleOutputMapper and MultipleOutputReducer classes will have map and reduce method receiving a MultipleOutputCollector.
          5. There will not be a MOJobConf, #1 takes care of seeding the named output data into the vanilla JobConf.

          Follow up comments/questions:

          • I don't see the point of the KeyValue class as it will not be possible to leverage the generic type checking at compile time (that is why #3 signature differs from your suggestion).
          • It seems to me that MultipleOutputFormats would have to have its own subclasses for Text and SequenceFile to create the right RecordWriter within the getRecordWriter method as I could have named outputs that are TextOutputFormat or SequenceFileOutputFormat. And the MultipleOutputFormats would be restricted to handle those 2 output format types.

          Thoughts?

          Show
          Alejandro Abdelnur added a comment - Owen, let me try filling the gaps in your suggestion. MultipleOutputFormats extends MultipleOutputFormat providing a static method to define named outputs in the job conf including outputformat, key and value classes. MultipleOutputFormats will also implement the logic to create the per task file names based on the given named output. A MultipleOutputCollector will provide a collect(String namedOutput, WritableComparable key, Writable value) method. MultipleOutputMapper and MultipleOutputReducer classes will have map and reduce method receiving a MultipleOutputCollector . There will not be a MOJobConf , #1 takes care of seeding the named output data into the vanilla JobConf . Follow up comments/questions: I don't see the point of the KeyValue class as it will not be possible to leverage the generic type checking at compile time (that is why #3 signature differs from your suggestion). It seems to me that MultipleOutputFormats would have to have its own subclasses for Text and SequenceFile to create the right RecordWriter within the getRecordWriter method as I could have named outputs that are TextOutputFormat or SequenceFileOutputFormat . And the MultipleOutputFormats would be restricted to handle those 2 output format types. Thoughts?
          Hide
          Runping Qi added a comment -

          The key to use MultipleOutputFormat class is to define a subclass that implements

          abstract protected RecordWriter<K, V> getBaseRecordWriter(FileSystem fs,
                JobConf job, String name, Progressable arg3) throws IOException;
          

          MultipleSequenceFileOutputFormat and MultipleTextOutputFormat are two simple but commonly used sub classes.
          If you need to use different type of record writer classes (TextRecordWriter and SequenceFileRecordWriter) for different output files,
          you need to have a subclass implementing that logic in getBaseRecordWriter.
          You can use the name argument and/or any info in the JobConf argument to decide whether to create a TextRecordReader or SequenceFileRecordWriter. Since signatures for TextRecordWriter and SequenceFileRecordWriter are not quite compatible,
          you may beed to fiddle with the types a bit by type casting.

          Show
          Runping Qi added a comment - The key to use MultipleOutputFormat class is to define a subclass that implements abstract protected RecordWriter<K, V> getBaseRecordWriter(FileSystem fs, JobConf job, String name, Progressable arg3) throws IOException; MultipleSequenceFileOutputFormat and MultipleTextOutputFormat are two simple but commonly used sub classes. If you need to use different type of record writer classes (TextRecordWriter and SequenceFileRecordWriter) for different output files, you need to have a subclass implementing that logic in getBaseRecordWriter. You can use the name argument and/or any info in the JobConf argument to decide whether to create a TextRecordReader or SequenceFileRecordWriter. Since signatures for TextRecordWriter and SequenceFileRecordWriter are not quite compatible, you may beed to fiddle with the types a bit by type casting.
          Hide
          Alejandro Abdelnur added a comment -

          Got it.

          Implemented the getBaseRecordWriter doing a trick to work with arbitrary OutputFormats.

          I had to modify the getBaseRecordWriter signature to pass the leafName as parameter, to be able to obtain the corresponding OutputFormat/Key/Value classes.

          It took me a while to understand how recordWriters are cached within a single RecordWriter, IMO kind of too twisted.

          The current MultipleSequenceFileOutputFormat has a limitation, it only works with the same key and value classes as the standard job output.

          Show
          Alejandro Abdelnur added a comment - Got it. Implemented the getBaseRecordWriter doing a trick to work with arbitrary OutputFormats. I had to modify the getBaseRecordWriter signature to pass the leafName as parameter, to be able to obtain the corresponding OutputFormat/Key/Value classes. It took me a while to understand how recordWriters are cached within a single RecordWriter, IMO kind of too twisted. The current MultipleSequenceFileOutputFormat has a limitation, it only works with the same key and value classes as the standard job output.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12379107/patch3149.txt
          against trunk revision 643282.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 3 new or modified tests.

          javadoc -1. The javadoc tool appears to have generated 1 warning messages.

          javac -1. The applied patch generated 524 javac compiler warnings (more than the trunk's current 521 warnings).

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs -1. The patch appears to introduce 2 new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2121/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2121/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2121/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2121/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12379107/patch3149.txt against trunk revision 643282. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc -1. The javadoc tool appears to have generated 1 warning messages. javac -1. The applied patch generated 524 javac compiler warnings (more than the trunk's current 521 warnings). release audit +1. The applied patch does not generate any new release audit warnings. findbugs -1. The patch appears to introduce 2 new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2121/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2121/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2121/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2121/console This message is automatically generated.
          Hide
          Alejandro Abdelnur added a comment -

          corrections to indentation/style problems in new files.

          ignoring indentation problems in modified files in lines the patch does not touch.


          on the findbug report on MultipleOutputTask

          Regarding "IS2_INCONSISTENT_SYNC: Inconsistent synchronization" the non-synched access is in configure, the synched access is during writes. The var in question is conf which is used in read-only mode.


          Show
          Alejandro Abdelnur added a comment - corrections to indentation/style problems in new files. ignoring indentation problems in modified files in lines the patch does not touch. on the findbug report on MultipleOutputTask Regarding "IS2_INCONSISTENT_SYNC: Inconsistent synchronization" the non-synched access is in configure, the synched access is during writes. The var in question is conf which is used in read-only mode.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12379128/patch3149.txt
          against trunk revision 643282.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 3 new or modified tests.

          javadoc -1. The javadoc tool appears to have generated 1 warning messages.

          javac -1. The applied patch generated 524 javac compiler warnings (more than the trunk's current 521 warnings).

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs -1. The patch appears to introduce 1 new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2124/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2124/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2124/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2124/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12379128/patch3149.txt against trunk revision 643282. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc -1. The javadoc tool appears to have generated 1 warning messages. javac -1. The applied patch generated 524 javac compiler warnings (more than the trunk's current 521 warnings). release audit +1. The applied patch does not generate any new release audit warnings. findbugs -1. The patch appears to introduce 1 new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2124/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2124/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2124/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2124/console This message is automatically generated.
          Hide
          Owen O'Malley added a comment -

          You've got to fix the findbugs warning and javadoc warnings.

          You also missed the point of the KeyValue class. The point was to make:

          job.setOutputKeyClass(Text.class);
          job.setOutputValueClass(KeyValue.class);
          

          So that you can emit the same type out of the map/reduce. The multiple output format then generates the real key and value by pulling them out of the KeyValue object.

          I don't see why you feel the need to limit the names to [a-zA-Z0-9]. Surely, it doesn't matter if they use underscore or period...

          There should not be multiple output task or multiple output collector classes...

          Show
          Owen O'Malley added a comment - You've got to fix the findbugs warning and javadoc warnings. You also missed the point of the KeyValue class. The point was to make: job.setOutputKeyClass(Text.class); job.setOutputValueClass(KeyValue.class); So that you can emit the same type out of the map/reduce. The multiple output format then generates the real key and value by pulling them out of the KeyValue object. I don't see why you feel the need to limit the names to [a-zA-Z0-9] . Surely, it doesn't matter if they use underscore or period... There should not be multiple output task or multiple output collector classes...
          Hide
          Alejandro Abdelnur added a comment - - edited

          findbugs, fixed one, the second one IS is a false positive (explained that in previous comment)

          fixing checksytle warnings in next version of patch.

          • It is not always desirable the OutputFormat/Key/Value classes of all outputs to be the same, we have several use cases where they are different (I've point this as a limitation of the Runping's patch).
          • Limiting the names to [a-zA-Z0-9] has a purpose, the names are used to create the files under the output dir, you don't want to get funny characters in the leafname.

          Now I understand what you have in mind with the KeyValue object and not having MultipleOutput* classes. I have the following comments on that:

          1. first bullet item above.
          2. By piggybacking on the current collector to write to multiple outputs, while you avoid introducing new classes, is not obvious and I would say confusing for the developer, I'd rather have explicit methods/classes for handling multiple outputs.

          I would rather refactor code to:

          • get rid of the MultipleOutput Task/Map/Reduce/Collector classes.
          • Don't introduce a KeyValue class.

          Add a collect(String namedOutput, WritableComparable key, Writable value) method to the MultipleOutputs and the usage pattern would be

          public class MyReducer implements Reducer {
            private MultipleOutputs mos;
          
            public void configure(JobConf conf) {
              mos = new MultipleOutputs(conf);
            }
          
            public void reduce(WritableComparable key, Iterator<Writable> values, OutputCollector collector, Reporter reporter) throws IOException {
              Writable value = values.next();
              collector.collect(key,value);
              mos.collect("aa", key, value);
            }
          
            public void close() throws IOException {
              mos.close();
            }
          
          }
          

          configuration of the job prior to dispatching would remain the same.

          Show
          Alejandro Abdelnur added a comment - - edited findbugs, fixed one, the second one IS is a false positive (explained that in previous comment) fixing checksytle warnings in next version of patch. It is not always desirable the OutputFormat/Key/Value classes of all outputs to be the same, we have several use cases where they are different (I've point this as a limitation of the Runping's patch). Limiting the names to [a-zA-Z0-9] has a purpose, the names are used to create the files under the output dir, you don't want to get funny characters in the leafname. Now I understand what you have in mind with the KeyValue object and not having MultipleOutput* classes. I have the following comments on that: first bullet item above. By piggybacking on the current collector to write to multiple outputs, while you avoid introducing new classes, is not obvious and I would say confusing for the developer, I'd rather have explicit methods/classes for handling multiple outputs. I would rather refactor code to: get rid of the MultipleOutput Task/Map/Reduce/Collector classes. Don't introduce a KeyValue class. Add a collect(String namedOutput, WritableComparable key, Writable value) method to the MultipleOutputs and the usage pattern would be public class MyReducer implements Reducer { private MultipleOutputs mos; public void configure(JobConf conf) { mos = new MultipleOutputs(conf); } public void reduce(WritableComparable key, Iterator<Writable> values, OutputCollector collector, Reporter reporter) throws IOException { Writable value = values.next(); collector.collect(key,value); mos.collect("aa", key, value); } public void close() throws IOException { mos.close(); } } configuration of the job prior to dispatching would remain the same.
          Hide
          Alejandro Abdelnur added a comment -

          Modified patch as per my last comments.

          Show
          Alejandro Abdelnur added a comment - Modified patch as per my last comments.
          Hide
          Runping Qi added a comment -

          Alejandro,

          Having a private mos in mapper/reducer class, in additional the standard collector passed in through map/reduce call is really urgly and is redudant, and is not compatible with the common patterns of applications using map/reduce framework.
          All the output dispatching logic should be handled by the class of collecor argument.
          That is exactly why MultipleOutputFormat classes were introduced. You may think it is not obvious at first. But like many of the other aspects of
          map/reduce framework, you will get used to it, and once you get used to it, you'll find it very natural.

          I found it is most common to that all the output key values are of the same types.
          In case you need to write keys/values of different types, you may need to encapsulate them in a container class and use
          the container class as the key (value) class for the file. The MultipleOutputFormat allows you to write key/value of different types to different files.

          Show
          Runping Qi added a comment - Alejandro, Having a private mos in mapper/reducer class, in additional the standard collector passed in through map/reduce call is really urgly and is redudant, and is not compatible with the common patterns of applications using map/reduce framework. All the output dispatching logic should be handled by the class of collecor argument. That is exactly why MultipleOutputFormat classes were introduced. You may think it is not obvious at first. But like many of the other aspects of map/reduce framework, you will get used to it, and once you get used to it, you'll find it very natural. I found it is most common to that all the output key values are of the same types. In case you need to write keys/values of different types, you may need to encapsulate them in a container class and use the container class as the key (value) class for the file. The MultipleOutputFormat allows you to write key/value of different types to different files.
          Hide
          Alejandro Abdelnur added a comment -

          Runping,

          Yes, I've got it. Please look at my last patch.

          If the consensus is that MultipleOutputs should not be in Hadoop, then I would request o have the proposed signature change to the getBaseRecordWriter method to have the original leafName.

          Thxs.

          Show
          Alejandro Abdelnur added a comment - Runping, Yes, I've got it. Please look at my last patch. If the consensus is that MultipleOutputs should not be in Hadoop, then I would request o have the proposed signature change to the getBaseRecordWriter method to have the original leafName. Thxs.
          Hide
          Runping Qi added a comment -

          Alejandro,

          If you look at the code carefully, you will notice that the the value passed to the name argument of getBaseRecordWriter method go through the
          following transformation steps:

          
                  final String myName = generateLeafFileName(name);
          
                  String keyBasedPath = generateFileNameForKeyValue(key, value, myName);
          
                  // get the file name based on the input file name
                  String finalPath = getInputFileBasedOutputFileName(myJob, keyBasedPath);
          
          

          The methods, generateLeafFileName, generateFileNameForKeyValue and getInputFileBasedOutputFileName, are
          all protected methods. The sub classes can implement specific file name transformation logic, based on key/value or based on
          the original input file name. The default implementations do not transform the file name.
          Thus, by default, the value passed to the name argument of getBaseRecordWriter method IS the original leafName.
          Within the implementation, you still have opportunity to perform file name transformation if that deems to be necessary.

          Hope this helps.

          Show
          Runping Qi added a comment - Alejandro, If you look at the code carefully, you will notice that the the value passed to the name argument of getBaseRecordWriter method go through the following transformation steps: final String myName = generateLeafFileName(name); String keyBasedPath = generateFileNameForKeyValue(key, value, myName); // get the file name based on the input file name String finalPath = getInputFileBasedOutputFileName(myJob, keyBasedPath); The methods, generateLeafFileName, generateFileNameForKeyValue and getInputFileBasedOutputFileName, are all protected methods. The sub classes can implement specific file name transformation logic, based on key/value or based on the original input file name. The default implementations do not transform the file name. Thus, by default, the value passed to the name argument of getBaseRecordWriter method IS the original leafName. Within the implementation, you still have opportunity to perform file name transformation if that deems to be necessary. Hope this helps.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12379227/patch3149.txt
          against trunk revision 643282.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 3 new or modified tests.

          javadoc -1. The javadoc tool appears to have generated 1 warning messages.

          javac -1. The applied patch generated 511 javac compiler warnings (more than the trunk's current 510 warnings).

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2139/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2139/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2139/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2139/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12379227/patch3149.txt against trunk revision 643282. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc -1. The javadoc tool appears to have generated 1 warning messages. javac -1. The applied patch generated 511 javac compiler warnings (more than the trunk's current 510 warnings). release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2139/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2139/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2139/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2139/console This message is automatically generated.
          Hide
          Owen O'Malley added a comment -

          Runping, I'm confused why there are three steps to generating the name. It seems really confusing. Why not have:

          String finalPath = generateLeafFileName(name, key, job);
          

          possibly with the value, if it really makes sense. Of course, that would be a separate bug...

          Show
          Owen O'Malley added a comment - Runping, I'm confused why there are three steps to generating the name. It seems really confusing. Why not have: String finalPath = generateLeafFileName(name, key, job); possibly with the value, if it really makes sense. Of course, that would be a separate bug...
          Hide
          Owen O'Malley added a comment -

          I think this patch is still too big & complex for what the goal is. (A simplier and friendlier interface to the multiple output format.)

          I don't understand your objection to the KeyValue class. It is designed precisely so that you can emit different key and value types to each of the base output formats from your reducer (or mapper, if there is no reduce). Using a wrapper class, makes the polymorphism easy and is cheap for performance.

          I do agree that the getBaseRecordWriter should get the virtual key (aka leaf name).

          Show
          Owen O'Malley added a comment - I think this patch is still too big & complex for what the goal is. (A simplier and friendlier interface to the multiple output format.) I don't understand your objection to the KeyValue class. It is designed precisely so that you can emit different key and value types to each of the base output formats from your reducer (or mapper, if there is no reduce). Using a wrapper class, makes the polymorphism easy and is cheap for performance. I do agree that the getBaseRecordWriter should get the virtual key (aka leaf name).
          Hide
          Runping Qi added a comment -

          bq: possibly with the value, if it really makes sense. Of course, that would be a separate bug...

          yes. it makes sense to combine the first two steps. please open anew jira and we can address there.

          Show
          Runping Qi added a comment - bq: possibly with the value, if it really makes sense. Of course, that would be a separate bug... yes. it makes sense to combine the first two steps. please open anew jira and we can address there.
          Hide
          Alejandro Abdelnur added a comment -

          My concern with the KeyValue class and how it would be used is that is not and intuitive use of the outputcollector. The expectation is that the collect(key,value) method takes a key and a value, with the use of the KeyValue then the key is the named output and the value is the key/value to output. IMO this is awkward.

          While the MultipleOutputFormat provides a powerful mechanism to use the key/value to redirect the output in very clever ways, still I think there should be a simple API to just write key/values to different outputs. And, as you suggested, based on the MultipleOutputFormat.

          Regarding the finalPath generation and the 2 steps, agree that is confusing how things are done. Still I think the 2 steps are needed, one to generate the filename without partition info (to avoid name collision) and the second to add teh partition info (to avoid name collision). Probably the current method names are not correct and thus the confussion it generates.

          Also, in the getRecordWriter() in the anonymous RecordWriter that is being returned, the TreeMap used to store the different record writers, it is being keyed off using the finalPath of the file, shouldn't be using the keyBasedPath?

          Show
          Alejandro Abdelnur added a comment - My concern with the KeyValue class and how it would be used is that is not and intuitive use of the outputcollector. The expectation is that the collect(key,value) method takes a key and a value, with the use of the KeyValue then the key is the named output and the value is the key/value to output. IMO this is awkward. While the MultipleOutputFormat provides a powerful mechanism to use the key/value to redirect the output in very clever ways, still I think there should be a simple API to just write key/values to different outputs. And, as you suggested, based on the MultipleOutputFormat. Regarding the finalPath generation and the 2 steps, agree that is confusing how things are done. Still I think the 2 steps are needed, one to generate the filename without partition info (to avoid name collision) and the second to add teh partition info (to avoid name collision). Probably the current method names are not correct and thus the confussion it generates. Also, in the getRecordWriter() in the anonymous RecordWriter that is being returned, the TreeMap used to store the different record writers, it is being keyed off using the finalPath of the file, shouldn't be using the keyBasedPath?
          Hide
          Nigel Daley added a comment -

          This feature missed 0.17 feature freeze.

          Show
          Nigel Daley added a comment - This feature missed 0.17 feature freeze.
          Hide
          Alejandro Abdelnur added a comment -

          Removed dependency on the MultipleOutputFormat.

          Code became simpler as MultipleOutputs addresses specifically the case of defining a fixed set of additional outputs during the job configuration.

          The patch, as before, allows the different outputs to have different OutputFormat and Key/Value classes from the job output configuration and among them.

          Included in the Javadoc sample usage.

          Show
          Alejandro Abdelnur added a comment - Removed dependency on the MultipleOutputFormat. Code became simpler as MultipleOutputs addresses specifically the case of defining a fixed set of additional outputs during the job configuration. The patch, as before, allows the different outputs to have different OutputFormat and Key/Value classes from the job output configuration and among them. Included in the Javadoc sample usage.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12380141/patch3149.txt
          against trunk revision 645773.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 3 new or modified tests.

          javadoc -1. The javadoc tool appears to have generated 1 warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2235/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2235/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2235/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2235/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12380141/patch3149.txt against trunk revision 645773. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc -1. The javadoc tool appears to have generated 1 warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2235/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2235/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2235/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2235/console This message is automatically generated.
          Hide
          Alejandro Abdelnur added a comment -

          fixing javadoc warning

          Show
          Alejandro Abdelnur added a comment - fixing javadoc warning
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12380157/patch3149.txt
          against trunk revision 645773.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 3 new or modified tests.

          javadoc -1. The javadoc tool appears to have generated 1 warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests -1. The patch failed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2238/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2238/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2238/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2238/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12380157/patch3149.txt against trunk revision 645773. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc -1. The javadoc tool appears to have generated 1 warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2238/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2238/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2238/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2238/console This message is automatically generated.
          Hide
          Alejandro Abdelnur added a comment -

          This patch adds a property with the name of the current namedOuput to the JobConf used to create the RecordReader.

          Using this property custom OutputFormats can obtain the name of the namedOutput being created and do any nameoutput specific configuration.

          Show
          Alejandro Abdelnur added a comment - This patch adds a property with the name of the current namedOuput to the JobConf used to create the RecordReader. Using this property custom OutputFormats can obtain the name of the namedOutput being created and do any nameoutput specific configuration.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12380363/patch3149.txt
          against trunk revision 645773.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 3 new or modified tests.

          javadoc -1. The javadoc tool appears to have generated 1 warning messages.

          javac -1. The applied patch generated 483 javac compiler warnings (more than the trunk's current 482 warnings).

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests -1. The patch failed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2260/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2260/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2260/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2260/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12380363/patch3149.txt against trunk revision 645773. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc -1. The javadoc tool appears to have generated 1 warning messages. javac -1. The applied patch generated 483 javac compiler warnings (more than the trunk's current 482 warnings). release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2260/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2260/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2260/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2260/console This message is automatically generated.
          Hide
          Owen O'Malley added a comment -

          This is getting close, but you still have 8 java doc warnings:

           [javadoc] Building tree for all the packages and classes...
            [javadoc] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/java/org/apache/hadoop/mapred/lib/MultipleOutputs.java:99: warning - Tag @link: missing '#': "addNamedOutput()"
            [javadoc] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/java/org/apache/hadoop/mapred/lib/MultipleOutputs.java:99: warning - Tag @link: can't find addNamedOutput() in org.apache.hadoop.mapred.lib.MultipleOutputs
            [javadoc] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/java/org/apache/hadoop/mapred/lib/MultipleOutputs.java:99: warning - Tag @link: reference not found: MultipleOutputMapper
            [javadoc] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/java/org/apache/hadoop/mapred/lib/MultipleOutputs.java:99: warning - Tag @link: reference not found: MultipleOutputReducer
            [javadoc] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/java/org/apache/hadoop/mapred/lib/MultipleOutputs.java:99: warning - Tag @link: reference not found: MultipleOutputMapper
            [javadoc] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/java/org/apache/hadoop/mapred/lib/MultipleOutputs.java:99: warning - Tag @link: reference not found: MultipleOutputReducer
            [javadoc] Building index for all the packages and classes...
            [javadoc] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/java/org/apache/hadoop/mapred/lib/MultipleOutputs.java:99: warning - Tag @link: reference not found: MultipleOutputMapper
            [javadoc] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/java/org/apache/hadoop/mapred/lib/MultipleOutputs.java:99: warning - Tag @link: reference not found: MultipleOutputReducer
          

          that you need to fix.

          You should also remove the references to Writable and WritableComparable and make them Object instead.

          I'm still not wild about having 2 completely separate interfaces for multiple outputs in the library, but this one is easier to use, if less flexible...

          Show
          Owen O'Malley added a comment - This is getting close, but you still have 8 java doc warnings: [javadoc] Building tree for all the packages and classes... [javadoc] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/java/org/apache/hadoop/mapred/lib/MultipleOutputs.java:99: warning - Tag @link: missing '#': "addNamedOutput()" [javadoc] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/java/org/apache/hadoop/mapred/lib/MultipleOutputs.java:99: warning - Tag @link: can't find addNamedOutput() in org.apache.hadoop.mapred.lib.MultipleOutputs [javadoc] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/java/org/apache/hadoop/mapred/lib/MultipleOutputs.java:99: warning - Tag @link: reference not found: MultipleOutputMapper [javadoc] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/java/org/apache/hadoop/mapred/lib/MultipleOutputs.java:99: warning - Tag @link: reference not found: MultipleOutputReducer [javadoc] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/java/org/apache/hadoop/mapred/lib/MultipleOutputs.java:99: warning - Tag @link: reference not found: MultipleOutputMapper [javadoc] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/java/org/apache/hadoop/mapred/lib/MultipleOutputs.java:99: warning - Tag @link: reference not found: MultipleOutputReducer [javadoc] Building index for all the packages and classes... [javadoc] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/java/org/apache/hadoop/mapred/lib/MultipleOutputs.java:99: warning - Tag @link: reference not found: MultipleOutputMapper [javadoc] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/java/org/apache/hadoop/mapred/lib/MultipleOutputs.java:99: warning - Tag @link: reference not found: MultipleOutputReducer that you need to fix. You should also remove the references to Writable and WritableComparable and make them Object instead. I'm still not wild about having 2 completely separate interfaces for multiple outputs in the library, but this one is easier to use, if less flexible...
          Hide
          Alejandro Abdelnur added a comment -

          fixing javadocs.

          >> You should also remove the references to Writable and WritableComparable and make them Object instead.

          Where? Do you mean in the MultipleOutputs.collect() method or in the static methods for configuration?

          Show
          Alejandro Abdelnur added a comment - fixing javadocs. >> You should also remove the references to Writable and WritableComparable and make them Object instead. Where? Do you mean in the MultipleOutputs.collect() method or in the static methods for configuration?
          Hide
          Alejandro Abdelnur added a comment -

          Fixing javadocs.

          Regarding the removal of Writable and WritableComparable replacing them with Object:

          If I understand things correctly, all API used directly from mapper/reducer code should be typed or generified to enforce as much type safety as possible at compile time. Internal API are more lax. If my assumption is correct then this API should use the Writable/WritableComparable and generics.

          Show
          Alejandro Abdelnur added a comment - Fixing javadocs. Regarding the removal of Writable and WritableComparable replacing them with Object: If I understand things correctly, all API used directly from mapper/reducer code should be typed or generified to enforce as much type safety as possible at compile time. Internal API are more lax. If my assumption is correct then this API should use the Writable/WritableComparable and generics.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12380891/patch3149.txt
          against trunk revision 645773.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 3 new or modified tests.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac -1. The applied patch generated 483 javac compiler warnings (more than the trunk's current 482 warnings).

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2323/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2323/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2323/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2323/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12380891/patch3149.txt against trunk revision 645773. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac -1. The applied patch generated 483 javac compiler warnings (more than the trunk's current 482 warnings). release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2323/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2323/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2323/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2323/console This message is automatically generated.
          Hide
          Alejandro Abdelnur added a comment -

          taking care of javac warning (in TestCase)

          Show
          Alejandro Abdelnur added a comment - taking care of javac warning (in TestCase)
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12381415/patch3149.txt
          against trunk revision 653264.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2395/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2395/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2395/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2395/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12381415/patch3149.txt against trunk revision 653264. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2395/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2395/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2395/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2395/console This message is automatically generated.
          Hide
          Alejandro Abdelnur added a comment - - edited

          Regarding Owen's You should also remove the references to Writable and WritableComparable and make them Object instead. comment.

          Assuming that it refers to generified Objects, I still think it is not applicable to remove the use of WritableComparable and Writable from the MultipleOutputs.collect() .

          The K and V classes uses in the different outputs of a MultipleOutputs may differ among the outputs and they may differ from the K and V classes used by the M/R.

          Because of this they can not be bound to the K and V generics of the Mapper and Reducer classes, thus they have to accept the base class.

          For example:

            JobConf conf = new JobConf();
          
            conf.setMapperClass(MOMap.class);
            conf.setReducerClass(MOReduce.class);
          
            conf.setInputPath(inDir);
            conf.setInputFormat(TextInputFormat.class);
          
            conf.setMapOutputKeyClass(LongWritable.class);
            conf.setMapOutputValueClass(Text.class);
          
            FileOutputFormat.setOutputPath(conf, outDir);
          
            conf.setOutputFormat(TextOutputFormat.class);
            conf.setOutputKeyClass(Text.class);
            conf.setOutputValueClass(Text.class);
          
            MultipleOutputs.addNamedOutput(conf, "text", TextOutputFormat.class, LongWritable.class, Text.class);
            MultipleOutputs.addNamedOutput(conf, "sequence", Text.class, MapWritable.class, Text.class);
            ...
          
          public class MOReduce implements Reducer<LongWritable, Text, Text, Text> {
            private MultipleOutputs mos;
          
            public void configure(JobConf conf) {
              mos = new MultipleOutputs(conf);
            }
          
            public void reduce(LongWritable key, Iterator&lt;Text&gt; values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
              output.collect(new Text("a"), value);
              mos.collect("text", key, new Text("Hello"), reporter);
              mos.collect("sequence", value, new MapWritable(), reporter);
            }
          
            public void close() throws IOException {
              mos.close();
            }
          
          }
          
          Show
          Alejandro Abdelnur added a comment - - edited Regarding Owen's You should also remove the references to Writable and WritableComparable and make them Object instead. comment. Assuming that it refers to generified Objects, I still think it is not applicable to remove the use of WritableComparable and Writable from the MultipleOutputs.collect() . The K and V classes uses in the different outputs of a MultipleOutputs may differ among the outputs and they may differ from the K and V classes used by the M/R. Because of this they can not be bound to the K and V generics of the Mapper and Reducer classes, thus they have to accept the base class. For example: JobConf conf = new JobConf(); conf.setMapperClass(MOMap.class); conf.setReducerClass(MOReduce.class); conf.setInputPath(inDir); conf.setInputFormat(TextInputFormat.class); conf.setMapOutputKeyClass(LongWritable.class); conf.setMapOutputValueClass(Text.class); FileOutputFormat.setOutputPath(conf, outDir); conf.setOutputFormat(TextOutputFormat.class); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); MultipleOutputs.addNamedOutput(conf, "text", TextOutputFormat.class, LongWritable.class, Text.class); MultipleOutputs.addNamedOutput(conf, "sequence", Text.class, MapWritable.class, Text.class); ... public class MOReduce implements Reducer<LongWritable, Text, Text, Text> { private MultipleOutputs mos; public void configure(JobConf conf) { mos = new MultipleOutputs(conf); } public void reduce(LongWritable key, Iterator&lt;Text&gt; values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { output.collect(new Text("a"), value); mos.collect("text", key, new Text("Hello"), reporter); mos.collect("sequence", value, new MapWritable(), reporter); } public void close() throws IOException { mos.close(); } }
          Hide
          Alejandro Abdelnur added a comment -

          How about the following API change in the MultipleOutputs?

          Instead the void collect(String, WritableComparable, Writable) method have a OutputCollector getCollector(String namedOutput, Reporter reporter) .

          This would bring consistency on the API use to write K/V out and it would allow the reuse of code that accepts OutputCollectors to write K/V to named outputs. For example:

          public void base64AndWrite(WritableComparable key, Writable value,OutputCollector<Text,Text> collector);

          Then this method could be used with the M/R OutputCollectors and reused with for MultipleOutputs

          Show
          Alejandro Abdelnur added a comment - How about the following API change in the MultipleOutputs ? Instead the void collect(String, WritableComparable, Writable) method have a OutputCollector getCollector(String namedOutput, Reporter reporter) . This would bring consistency on the API use to write K/V out and it would allow the reuse of code that accepts OutputCollectors to write K/V to named outputs. For example: public void base64AndWrite(WritableComparable key, Writable value,OutputCollector<Text,Text> collector); Then this method could be used with the M/R OutputCollectors and reused with for MultipleOutputs
          Hide
          Owen O'Malley added a comment -

          No, if you look at the current (0.17 via HADOOP-1986) interface for OutputCollector is now:

          public interface OutputCollector<K,V> {
            void collect(K key, V value) throws IOException;
          }
          

          instead of:

          public interface OutputCollector<K extends WritableComparable, V extends Writable> {
            void collect(K key, V value) throws IOException;
          }
          

          So your interface should be compatible and take Objects instead of Writables...

          Show
          Owen O'Malley added a comment - No, if you look at the current (0.17 via HADOOP-1986 ) interface for OutputCollector is now: public interface OutputCollector<K,V> { void collect(K key, V value) throws IOException; } instead of: public interface OutputCollector<K extends WritableComparable, V extends Writable> { void collect(K key, V value) throws IOException; } So your interface should be compatible and take Objects instead of Writables...
          Hide
          Alejandro Abdelnur added a comment -

          Finally understood (I was not aware of Hadoop-1986).

          Removing used of WritableComparable and Writable

          Only remaining question would be the refactoring to return an OutputCollector instead having the collect() method (see my previous comment on that).

          Show
          Alejandro Abdelnur added a comment - Finally understood (I was not aware of Hadoop-1986). Removing used of WritableComparable and Writable Only remaining question would be the refactoring to return an OutputCollector instead having the collect() method (see my previous comment on that).
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12381855/patch3149.txt
          against trunk revision 655410.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2450/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2450/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2450/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2450/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12381855/patch3149.txt against trunk revision 655410. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2450/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2450/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2450/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2450/console This message is automatically generated.
          Hide
          Owen O'Malley added a comment -

          Alejandro,
          Sorry it took so long to get back to this one! I'm still troubled about having two classes in mapred.lib that do almost exactly the same thing. What would be the changes to Runping's interface that would make you happy with it?

          On a more pragmatic note, you don't need the InternalFileInputFormat. Just put it into the MultipleOutputs.

          It also bothered me to see you getting the default file system, but it is fine because you are just feeding it to the output format that shouldn't be using the file system parameter anyways.

          Show
          Owen O'Malley added a comment - Alejandro, Sorry it took so long to get back to this one! I'm still troubled about having two classes in mapred.lib that do almost exactly the same thing. What would be the changes to Runping's interface that would make you happy with it? On a more pragmatic note, you don't need the InternalFileInputFormat. Just put it into the MultipleOutputs. It also bothered me to see you getting the default file system, but it is fine because you are just feeding it to the output format that shouldn't be using the file system parameter anyways.
          Hide
          Alejandro Abdelnur added a comment -

          For our requirements the MultipleOutputFormat shortcomings (that are handled by the proposed patch) are:

          • It does not support K & V of different types for the different output names.
          • A Map/Reduce job cannot use it from both Map and Reduce.
          • If used from a Map the job cannot have a Reduce.
          • It is an abstract class and it requires different implementations for different OutputFormats.

          From the M/R developer usage perspective:

          • By looking at the M/R code is not obvious that data is not written to a different output and it falls outside of the M/R I/O flow.
          • It is not clear to which output the data is being written unless there is a clear understanding of the MOF implementation.

          IMO, the current MOF has great flexibility but that comes with a complexity cost in usage and understanding it.

          On your commento on not needing the InternalFileOutputFormat, you are right, I could make MultipleOutputs to extends FileOutputFormat and have the getRecordWriter() method there.

          On your comment on using the default file system, I did that because the OutputFormat interface indicates that is ignored and because the file to be created goes in the same place of the job output, thus using the job conf for it.

          Show
          Alejandro Abdelnur added a comment - For our requirements the MultipleOutputFormat shortcomings (that are handled by the proposed patch) are: It does not support K & V of different types for the different output names. A Map/Reduce job cannot use it from both Map and Reduce. If used from a Map the job cannot have a Reduce. It is an abstract class and it requires different implementations for different OutputFormats. From the M/R developer usage perspective: By looking at the M/R code is not obvious that data is not written to a different output and it falls outside of the M/R I/O flow. It is not clear to which output the data is being written unless there is a clear understanding of the MOF implementation. IMO, the current MOF has great flexibility but that comes with a complexity cost in usage and understanding it. On your commento on not needing the InternalFileOutputFormat , you are right, I could make MultipleOutputs to extends FileOutputFormat and have the getRecordWriter() method there. On your comment on using the default file system, I did that because the OutputFormat interface indicates that is ignored and because the file to be created goes in the same place of the job output, thus using the job conf for it.
          Hide
          Alejandro Abdelnur added a comment -

          incorporated 3258 in the patch as that issue was to address a need in this issue.

          added support for multi named outputs. Being able to create multiple distinct outputs using the same outputformat, key class and value class in a dynamic way from the job.

          refactored to use the OutputCollector interface for writing to the named outputs.

          Show
          Alejandro Abdelnur added a comment - incorporated 3258 in the patch as that issue was to address a need in this issue. added support for multi named outputs. Being able to create multiple distinct outputs using the same outputformat, key class and value class in a dynamic way from the job. refactored to use the OutputCollector interface for writing to the named outputs.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12384642/patch3149.txt
          against trunk revision 671385.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

          -1 javac. The applied patch generated 447 javac compiler warnings (more than the trunk's current 442 warnings).

          -1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2730/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2730/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2730/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2730/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12384642/patch3149.txt against trunk revision 671385. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. -1 javac. The applied patch generated 447 javac compiler warnings (more than the trunk's current 442 warnings). -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2730/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2730/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2730/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2730/console This message is automatically generated.
          Hide
          Alejandro Abdelnur added a comment -

          fixing MultipleOutputs javadoc to avoid warning.

          the warnings on the FileOutputFormat are on areas of the file not affected by this patch.

          Show
          Alejandro Abdelnur added a comment - fixing MultipleOutputs javadoc to avoid warning. the warnings on the FileOutputFormat are on areas of the file not affected by this patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12384653/patch3149.txt
          against trunk revision 671385.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

          -1 javac. The applied patch generated 447 javac compiler warnings (more than the trunk's current 442 warnings).

          -1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2732/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2732/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2732/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2732/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12384653/patch3149.txt against trunk revision 671385. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. -1 javac. The applied patch generated 447 javac compiler warnings (more than the trunk's current 442 warnings). -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2732/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2732/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2732/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2732/console This message is automatically generated.
          Hide
          Alejandro Abdelnur added a comment -

          On the findbugs report, the use of a static variable is on purpose here. It is to allow the use of the same outputs from different components running in a Map or Reduce without having to pass the MultipleOutputs instance as parameter. It leverages the fact that tasks run in their own JVM.

          The core-test error does not seem related.

          Show
          Alejandro Abdelnur added a comment - On the findbugs report, the use of a static variable is on purpose here. It is to allow the use of the same outputs from different components running in a Map or Reduce without having to pass the MultipleOutputs instance as parameter. It leverages the fact that tasks run in their own JVM. The core-test error does not seem related.
          Hide
          Alejandro Abdelnur added a comment -

          patch has a bug on multi outputs that are not created properly.

          it also uses a static variable for the recorwriters, which is not correct (this was to handle a special case, but it may lead to errors).

          Show
          Alejandro Abdelnur added a comment - patch has a bug on multi outputs that are not created properly. it also uses a static variable for the recorwriters, which is not correct (this was to handle a special case, but it may lead to errors).
          Hide
          Alejandro Abdelnur added a comment -

          fixes file creation and handling for multi named outputs

          makes map of recordwriters an instance var

          uses a set for the namedotuputs names, making the code cleaner by not overloading the use of the recordwriters map.

          Show
          Alejandro Abdelnur added a comment - fixes file creation and handling for multi named outputs makes map of recordwriters an instance var uses a set for the namedotuputs names, making the code cleaner by not overloading the use of the recordwriters map.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12384738/patch3149.txt
          against trunk revision 671563.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

          -1 javac. The applied patch generated 449 javac compiler warnings (more than the trunk's current 442 warnings).

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2746/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2746/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2746/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2746/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12384738/patch3149.txt against trunk revision 671563. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. -1 javac. The applied patch generated 449 javac compiler warnings (more than the trunk's current 442 warnings). +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2746/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2746/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2746/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2746/console This message is automatically generated.
          Hide
          Owen O'Malley added a comment -

          This looks ready to go other than the javac and javadoc warnings. Please fix them. Sorry this has been such a long process on this feature!

          Show
          Owen O'Malley added a comment - This looks ready to go other than the javac and javadoc warnings. Please fix them. Sorry this has been such a long process on this feature!
          Hide
          Runping Qi added a comment -

          Just a reflection or a forward looking thought, will it be also better to change the reduce contract to:

          public void reduce(Object key, Iterator; values, MultipleOutputs mos, Reporter reporter) throws IOException {
              mos.collect("default", new Text("a"), value, reporter);
              mos.collect("text", key, new Text("Hello"), reporter);
              mos.collect("sequence", value, new MapWritable(), reporter);
            }
          

          Hope the future contact based API will take care of this.

          Show
          Runping Qi added a comment - Just a reflection or a forward looking thought, will it be also better to change the reduce contract to: public void reduce( Object key, Iterator; values, MultipleOutputs mos, Reporter reporter) throws IOException { mos.collect( " default " , new Text( "a" ), value, reporter); mos.collect( "text" , key, new Text( "Hello" ), reporter); mos.collect( "sequence" , value, new MapWritable(), reporter); } Hope the future contact based API will take care of this.
          Hide
          Alejandro Abdelnur added a comment -

          fixing javadoc/javac warnings

          Show
          Alejandro Abdelnur added a comment - fixing javadoc/javac warnings
          Hide
          Alejandro Abdelnur added a comment -

          MultipleOutputs getNamedOutputs() method is protected, it should be public.

          Show
          Alejandro Abdelnur added a comment - MultipleOutputs getNamedOutputs() method is protected, it should be public.
          Hide
          Alejandro Abdelnur added a comment -

          making MultipleOutputs getNamedOutputs() method public

          Show
          Alejandro Abdelnur added a comment - making MultipleOutputs getNamedOutputs() method public
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12385371/patch3149.txt
          against trunk revision 674592.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The applied patch generated 448 javac compiler warnings (more than the trunk's current 442 warnings).

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2803/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2803/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2803/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2803/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12385371/patch3149.txt against trunk revision 674592. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 448 javac compiler warnings (more than the trunk's current 442 warnings). +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2803/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2803/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2803/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2803/console This message is automatically generated.
          Hide
          Alejandro Abdelnur added a comment -

          trying once more to see if it passes javac warning check, only warnings I see are in the testcases.

          Show
          Alejandro Abdelnur added a comment - trying once more to see if it passes javac warning check, only warnings I see are in the testcases.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12385462/patch3149.txt
          against trunk revision 674932.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2817/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2817/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2817/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2817/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12385462/patch3149.txt against trunk revision 674932. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2817/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2817/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2817/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2817/console This message is automatically generated.
          Hide
          Devaraj Das added a comment -

          I just committed this. Thanks Alejandro for being so patient on this one!

          Show
          Devaraj Das added a comment - I just committed this. Thanks Alejandro for being so patient on this one!
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Created HADOOP-3836 for the problems in TestMultipleOutputs. I am not sure how to fix it. Could you take a look?

          Show
          Tsz Wo Nicholas Sze added a comment - Created HADOOP-3836 for the problems in TestMultipleOutputs. I am not sure how to fix it. Could you take a look?
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #581 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/581/ )

            People

            • Assignee:
              Alejandro Abdelnur
              Reporter:
              Alejandro Abdelnur
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development