Hadoop Common
  1. Hadoop Common
  2. HADOOP-2906

output format classes that can write to different files depending on keys and/or config variable

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.17.0
    • Component/s: None
    • Labels:
      None

      Description

      I've a few apps that require to write out data into different files/directories depending on keys and/or configuration variables.
      I've implemented such classes for those apps. I noticed that many other users have similar need from time to time.
      So I think it may be a good idea to contribute to Hadoop mapred.lib package so that other users can benefit from it.

        Activity

        Owen O'Malley made changes -
        Component/s mapred [ 12310690 ]
        Nigel Daley made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Hadoop-trunk #421 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/421/ )
        Chris Douglas made changes -
        Fix Version/s 0.17.0 [ 12312913 ]
        Resolution Fixed [ 1 ]
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Hide
        Chris Douglas added a comment -

        I just committed this. Thanks, Runping!

        Show
        Chris Douglas added a comment - I just committed this. Thanks, Runping!
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12377175/patch.2096.6.txt
        against trunk revision 619744.

        @author +1. The patch does not contain any @author tags.

        tests included +1. The patch appears to include 3 new or modified tests.

        javadoc +1. The javadoc tool did not generate any warning messages.

        javac +1. The applied patch does not generate any new javac compiler warnings.

        release audit +1. The applied patch does not generate any new release audit warnings.

        findbugs +1. The patch does not introduce any new Findbugs warnings.

        core tests +1. The patch passed core unit tests.

        contrib tests +1. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1897/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1897/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1897/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1897/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12377175/patch.2096.6.txt against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1897/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1897/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1897/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1897/console This message is automatically generated.
        Chris Douglas made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Chris Douglas made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Runping Qi made changes -
        Attachment patch.2096.6.txt [ 12377175 ]
        Hide
        Runping Qi added a comment -

        replaced the attribute name "num.of.trailing.legs.to/use' with "mapred.outputformat.numOfTrailingLegs"

        address the case where the number specified by the above variable is larger than the number of legs
        in the input file.

        Show
        Runping Qi added a comment - replaced the attribute name "num.of.trailing.legs.to/use' with "mapred.outputformat.numOfTrailingLegs" address the case where the number specified by the above variable is larger than the number of legs in the input file.
        Runping Qi made changes -
        Attachment patch.2096.5.txt [ 12377107 ]
        Hide
        Martin Traverso added a comment -

        I would suggest changing the name of the property from "num.of.trailing.legs.to.use" to something that reflects the hierarchy in which the property lives. Maybe something like mapred.output.format.multi.trailingLegs or similar.

        Show
        Martin Traverso added a comment - I would suggest changing the name of the property from "num.of.trailing.legs.to.use" to something that reflects the hierarchy in which the property lives. Maybe something like mapred.output.format.multi.trailingLegs or similar.
        Hide
        Chris Douglas added a comment -

        A couple suggestions:

        • If "num.of.trailing.legs.to.use" exceeds the number of segments in the input file path string, then this will throw an IllegalArgumentException from Path. A more helpful message should probably accompany this condition.
        • It might be worth calling out in the javadocs that generateActualKey and generateActualValue should be aware of side-effects, since write typically doesn't modify its args and the framework will reuse them. The code is clear enough that users can educate themselves, but this is deserving of a footnote.

        Otherwise, +1

        Show
        Chris Douglas added a comment - A couple suggestions: If "num.of.trailing.legs.to.use" exceeds the number of segments in the input file path string, then this will throw an IllegalArgumentException from Path. A more helpful message should probably accompany this condition. It might be worth calling out in the javadocs that generateActualKey and generateActualValue should be aware of side-effects, since write typically doesn't modify its args and the framework will reuse them. The code is clear enough that users can educate themselves, but this is deserving of a footnote. Otherwise, +1
        Runping Qi made changes -
        Attachment patch.2096.5.txt [ 12377107 ]
        Hide
        Runping Qi added a comment -

        Previously attached patch was wrong.
        Attach the correct version now.

        Show
        Runping Qi added a comment - Previously attached patch was wrong. Attach the correct version now.
        Runping Qi made changes -
        Attachment patch.2096.4 [ 12377027 ]
        Runping Qi made changes -
        Attachment patch.2096.4 [ 12377027 ]
        Runping Qi made changes -
        Attachment patch.2096.3.txt [ 12376785 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12376785/patch.2096.3.txt
        against trunk revision 619744.

        @author +1. The patch does not contain any @author tags.

        tests included +1. The patch appears to include 3 new or modified tests.

        javadoc +1. The javadoc tool did not generate any warning messages.

        javac +1. The applied patch does not generate any new javac compiler warnings.

        release audit +1. The applied patch does not generate any new release audit warnings.

        findbugs +1. The patch does not introduce any new Findbugs warnings.

        core tests -1. The patch failed core unit tests.

        contrib tests +1. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1872/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1872/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1872/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1872/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12376785/patch.2096.3.txt against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1872/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1872/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1872/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1872/console This message is automatically generated.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12376775/patch.2096.2.txt
        against trunk revision 619744.

        @author +1. The patch does not contain any @author tags.

        tests included +1. The patch appears to include 3 new or modified tests.

        javadoc +1. The javadoc tool did not generate any warning messages.

        javac +1. The applied patch does not generate any new javac compiler warnings.

        release audit +1. The applied patch does not generate any new release audit warnings.

        findbugs +1. The patch does not introduce any new Findbugs warnings.

        core tests -1. The patch failed core unit tests.

        contrib tests +1. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1869/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1869/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1869/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1869/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12376775/patch.2096.2.txt against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1869/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1869/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1869/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1869/console This message is automatically generated.
        Runping Qi made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Runping Qi made changes -
        Attachment patch.2096.2.txt [ 12376775 ]
        Runping Qi made changes -
        Attachment patch.2096.3.txt [ 12376785 ]
        Hide
        Runping Qi added a comment -

        Incorporate some feedback comments

        Show
        Runping Qi added a comment - Incorporate some feedback comments
        Runping Qi made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Runping Qi made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Runping Qi added a comment -

        There was a javac warning in the test class.
        The new patch fixes it.

        Show
        Runping Qi added a comment - There was a javac warning in the test class. The new patch fixes it.
        Runping Qi made changes -
        Attachment patch.2096.1.txt [ 12376745 ]
        Runping Qi made changes -
        Attachment patch.2096.2.txt [ 12376775 ]
        Runping Qi made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12376745/patch.2096.1.txt
        against trunk revision 619744.

        @author +1. The patch does not contain any @author tags.

        tests included +1. The patch appears to include 3 new or modified tests.

        javadoc +1. The javadoc tool did not generate any warning messages.

        javac -1. The applied patch generated 616 javac compiler warnings (more than the trunk's current 615 warnings).

        release audit +1. The applied patch does not generate any new release audit warnings.

        findbugs +1. The patch does not introduce any new Findbugs warnings.

        core tests -1. The patch failed core unit tests.

        contrib tests +1. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1862/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1862/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1862/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1862/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12376745/patch.2096.1.txt against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac -1. The applied patch generated 616 javac compiler warnings (more than the trunk's current 615 warnings). release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1862/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1862/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1862/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1862/console This message is automatically generated.
        Runping Qi made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Runping Qi added a comment -

        Finally managed to get rid of the javac warning

        Show
        Runping Qi added a comment - Finally managed to get rid of the javac warning
        Runping Qi made changes -
        Attachment patch.2096.txt [ 12376696 ]
        Runping Qi made changes -
        Attachment patch.2096.1.txt [ 12376745 ]
        Runping Qi made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Runping Qi added a comment -

        I think the extra javac warning is due to the
        @SuppressWarnings("unchecked") directive in the following code

              @SuppressWarnings("unchecked")
              public void write(WritableComparable key, Writable value) throws IOException {
        
                // get the file name based on the key
                String keyBasedPath = generateFileNameForKey(key, myName);
        
                // get the file name based on the input file name
                String finalPath = getInputFileBasedOutputFileName(myJob, keyBasedPath);
        
                // get the actual key
                WritableComparable actualKey = generateActualKey(key);
        
                RecordWriter rw = this.recordWriters.get(finalPath);
                if (rw == null) {
                  // if we don't have the record writer yet for the final path, create one
                  // and add it to the cache
                  rw = getRecordWriter_inner(myFS, myJob, finalPath, myProgressable);
                  this.recordWriters.put(finalPath, rw);
                }
                rw.write(actualKey, value);
              };
        

        Since javac warns about
        rw.write(actualKey, value)
        The reason for that is rw is RecordWriter type, not the parameterized one.
        The reason for that is that rw may be a record writer generated by SequenceFileOutputFormat
        which does not generate object of parameterized RecordWriter. Tried a few ways to get rid of the warning, but all failed.

        Show
        Runping Qi added a comment - I think the extra javac warning is due to the @SuppressWarnings("unchecked") directive in the following code @SuppressWarnings( "unchecked" ) public void write(WritableComparable key, Writable value) throws IOException { // get the file name based on the key String keyBasedPath = generateFileNameForKey(key, myName); // get the file name based on the input file name String finalPath = getInputFileBasedOutputFileName(myJob, keyBasedPath); // get the actual key WritableComparable actualKey = generateActualKey(key); RecordWriter rw = this .recordWriters.get(finalPath); if (rw == null ) { // if we don't have the record writer yet for the final path, create one // and add it to the cache rw = getRecordWriter_inner(myFS, myJob, finalPath, myProgressable); this .recordWriters.put(finalPath, rw); } rw.write(actualKey, value); }; Since javac warns about rw.write(actualKey, value) The reason for that is rw is RecordWriter type, not the parameterized one. The reason for that is that rw may be a record writer generated by SequenceFileOutputFormat which does not generate object of parameterized RecordWriter. Tried a few ways to get rid of the warning, but all failed.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12376696/patch.2096.txt
        against trunk revision 619744.

        @author +1. The patch does not contain any @author tags.

        tests included +1. The patch appears to include 3 new or modified tests.

        javadoc +1. The javadoc tool did not generate any warning messages.

        javac -1. The applied patch generated 620 javac compiler warnings (more than the trunk's current 619 warnings).

        release audit +1. The applied patch does not generate any new release audit warnings.

        findbugs +1. The patch does not introduce any new Findbugs warnings.

        core tests +1. The patch passed core unit tests.

        contrib tests +1. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1859/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1859/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1859/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1859/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12376696/patch.2096.txt against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac -1. The applied patch generated 620 javac compiler warnings (more than the trunk's current 619 warnings). release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1859/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1859/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1859/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1859/console This message is automatically generated.
        Runping Qi made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Runping Qi added a comment -

        The attached patch include a common abstract base class (MultipleOutputFormat) and two concrete classes:
        MultipleTextOutputFormat and MultipleSequenceFileOutputFormat. These classes implement the default behaviors,
        which are the same as TextOutputFormat class and SequenceFileOutputFormat class, respectively.
        The users can subclass these classes and overwrite one of the protected method to implement a specific logic
        of writing data to different output files.
        The patch also contains a test case, which also illustrates two special ways of using these classes.

        Show
        Runping Qi added a comment - The attached patch include a common abstract base class (MultipleOutputFormat) and two concrete classes: MultipleTextOutputFormat and MultipleSequenceFileOutputFormat. These classes implement the default behaviors, which are the same as TextOutputFormat class and SequenceFileOutputFormat class, respectively. The users can subclass these classes and overwrite one of the protected method to implement a specific logic of writing data to different output files. The patch also contains a test case, which also illustrates two special ways of using these classes.
        Runping Qi made changes -
        Attachment patch.2096.txt [ 12376696 ]
        Runping Qi made changes -
        Assignee Runping Qi [ runping ]
        Runping Qi made changes -
        Field Original Value New Value
        Description
        I've a few apps that require to write out data into different files/directories depending on keys and/or configuration variables.
        I've implemented such classes for those apps. I noticed that many other users have similar need from time to time.
        So I think it may be a good idea to contribute to Hadoop mapred.lib package so that other users can benefit from it.
        I've a few apps that require to write out data into different files/directories depending on keys and/or configuration variables.
        I've implemented such classes for those apps. I noticed that many other users have similar need from time to time.
        So I think it may be a good idea to contribute to Hadoop mapred.lib package so that other users can benefit from it.
        Component/s mapred [ 12310690 ]
        Runping Qi created issue -

          People

          • Assignee:
            Runping Qi
            Reporter:
            Runping Qi
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development