Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Ports MultipleOutputs to the new Map/Reduce API
    1. patch-370-5.txt
      29 kB
      Amareshwari Sriramadasu
    2. patch-370-4.txt
      28 kB
      Amareshwari Sriramadasu
    3. patch-370-3.txt
      35 kB
      Amareshwari Sriramadasu
    4. patch-370-2.txt
      33 kB
      Amareshwari Sriramadasu
    5. patch-370-1.txt
      35 kB
      Amareshwari Sriramadasu
    6. patch-370.txt
      34 kB
      Amareshwari Sriramadasu

      Activity

      Hide
      Hudson added a comment -

      Integrated in Hadoop-Mapreduce-trunk #75 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/75/)

      Show
      Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #75 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/75/ )
      Hide
      Hudson added a comment -

      Integrated in Hadoop-Mapreduce-trunk-Commit #17 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/17/)
      . Update MultipleOutputs to use the API, merge funcitonality
      of MultipleOutputFormat. Contributed by Amareshwari Sriramadasu

      Show
      Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #17 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/17/ ) . Update MultipleOutputs to use the API, merge funcitonality of MultipleOutputFormat. Contributed by Amareshwari Sriramadasu
      Hide
      Chris Douglas added a comment -

      +1

      The test failure, TestNodeRefresh.testMRExcludeHostsAcrossRestarts, is not related.

      I committed this. Thanks, Amareshwari!

      Show
      Chris Douglas added a comment - +1 The test failure, TestNodeRefresh.testMRExcludeHostsAcrossRestarts, is not related. I committed this. Thanks, Amareshwari!
      Hide
      Hadoop QA added a comment -

      -1 overall. Here are the results of testing the latest attachment
      http://issues.apache.org/jira/secure/attachment/12418584/patch-370-5.txt
      against trunk revision 811134.

      +1 @author. The patch does not contain any @author tags.

      +1 tests included. The patch appears to include 3 new or modified tests.

      +1 javadoc. The javadoc tool did not generate any warning messages.

      +1 javac. The applied patch does not increase the total number of javac compiler warnings.

      +1 findbugs. The patch does not introduce any new Findbugs warnings.

      +1 release audit. The applied patch does not increase the total number of release audit warnings.

      -1 core tests. The patch failed core unit tests.

      +1 contrib tests. The patch passed contrib unit tests.

      Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/40/testReport/
      Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/40/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
      Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/40/artifact/trunk/build/test/checkstyle-errors.html
      Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/40/console

      This message is automatically generated.

      Show
      Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12418584/patch-370-5.txt against trunk revision 811134. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/40/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/40/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/40/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/40/console This message is automatically generated.
      Hide
      Amareshwari Sriramadasu added a comment -

      re-submitting for hudson

      Show
      Amareshwari Sriramadasu added a comment - re-submitting for hudson
      Hide
      Hadoop QA added a comment -

      -1 overall. Here are the results of testing the latest attachment
      http://issues.apache.org/jira/secure/attachment/12418357/patch-370-4.txt
      against trunk revision 811134.

      +1 @author. The patch does not contain any @author tags.

      +1 tests included. The patch appears to include 3 new or modified tests.

      +1 javadoc. The javadoc tool did not generate any warning messages.

      +1 javac. The applied patch does not increase the total number of javac compiler warnings.

      +1 findbugs. The patch does not introduce any new Findbugs warnings.

      +1 release audit. The applied patch does not increase the total number of release audit warnings.

      -1 core tests. The patch failed core unit tests.

      +1 contrib tests. The patch passed contrib unit tests.

      Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/5/testReport/
      Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/5/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
      Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/5/artifact/trunk/build/test/checkstyle-errors.html
      Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/5/console

      This message is automatically generated.

      Show
      Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12418357/patch-370-4.txt against trunk revision 811134. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/5/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/5/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/5/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/5/console This message is automatically generated.
      Hide
      Amareshwari Sriramadasu added a comment -

      Attaching correct patch

      Show
      Amareshwari Sriramadasu added a comment - Attaching correct patch
      Hide
      Amareshwari Sriramadasu added a comment -

      Made get/setOutputName protected methods.

      Show
      Amareshwari Sriramadasu added a comment - Made get/setOutputName protected methods.
      Hide
      Chris Douglas added a comment -

      The patch looks good, but package-private access seems too restrictive for FileOutputFormat::get/setOutputName. Both should probably be both public and final, or at least protected.

      Show
      Chris Douglas added a comment - The patch looks good, but package-private access seems too restrictive for FileOutputFormat::get/setOutputName. Both should probably be both public and final, or at least protected.
      Hide
      Amareshwari Sriramadasu added a comment -

      -1 core tests. Is due to MAPREDUCE-943

      Show
      Amareshwari Sriramadasu added a comment - -1 core tests. Is due to MAPREDUCE-943
      Hide
      Hadoop QA added a comment -

      -1 overall. Here are the results of testing the latest attachment
      http://issues.apache.org/jira/secure/attachment/12418357/patch-370-4.txt
      against trunk revision 810330.

      +1 @author. The patch does not contain any @author tags.

      +1 tests included. The patch appears to include 3 new or modified tests.

      +1 javadoc. The javadoc tool did not generate any warning messages.

      +1 javac. The applied patch does not increase the total number of javac compiler warnings.

      +1 findbugs. The patch does not introduce any new Findbugs warnings.

      +1 release audit. The applied patch does not increase the total number of release audit warnings.

      -1 core tests. The patch failed core unit tests.

      +1 contrib tests. The patch passed contrib unit tests.

      Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/34/testReport/
      Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/34/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
      Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/34/artifact/trunk/build/test/checkstyle-errors.html
      Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/34/console

      This message is automatically generated.

      Show
      Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12418357/patch-370-4.txt against trunk revision 810330. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/34/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/34/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/34/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/34/console This message is automatically generated.
      Hide
      Amareshwari Sriramadasu added a comment -

      Chris suggested to pass filename in Configuration, instead of adding the method and making the change incompatible. That works like a charm.
      Attached patch removes earlier changes from FileOutputFormat and its extended classes. And modifies getDefaultWorkFile method to use the name specified in conf. If nothing is specified, "part" is used.

      Show
      Amareshwari Sriramadasu added a comment - Chris suggested to pass filename in Configuration, instead of adding the method and making the change incompatible. That works like a charm. Attached patch removes earlier changes from FileOutputFormat and its extended classes. And modifies getDefaultWorkFile method to use the name specified in conf. If nothing is specified, "part" is used.
      Hide
      Hudson added a comment -

      Integrated in Hadoop-Mapreduce-trunk-Commit #3 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/3/)
      Revert

      Show
      Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #3 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/3/ ) Revert
      Hide
      Amareshwari Sriramadasu added a comment -

      -1 core tests. Test failure, TestRecoveryManager is due to MAPREDUCE-880. TestPipes is due to MAPREDUCE-924.
      TestSeveral crashed on hudson, but it passed on my local machine.

      Show
      Amareshwari Sriramadasu added a comment - -1 core tests. Test failure, TestRecoveryManager is due to MAPREDUCE-880 . TestPipes is due to MAPREDUCE-924 . TestSeveral crashed on hudson, but it passed on my local machine.
      Hide
      Hadoop QA added a comment -

      -1 overall. Here are the results of testing the latest attachment
      http://issues.apache.org/jira/secure/attachment/12417954/patch-370-3.txt
      against trunk revision 808730.

      +1 @author. The patch does not contain any @author tags.

      +1 tests included. The patch appears to include 3 new or modified tests.

      +1 javadoc. The javadoc tool did not generate any warning messages.

      +1 javac. The applied patch does not increase the total number of javac compiler warnings.

      +1 findbugs. The patch does not introduce any new Findbugs warnings.

      +1 release audit. The applied patch does not increase the total number of release audit warnings.

      -1 core tests. The patch failed core unit tests.

      +1 contrib tests. The patch passed contrib unit tests.

      Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/533/testReport/
      Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/533/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
      Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/533/artifact/trunk/build/test/checkstyle-errors.html
      Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/533/console

      This message is automatically generated.

      Show
      Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12417954/patch-370-3.txt against trunk revision 808730. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/533/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/533/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/533/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/533/console This message is automatically generated.
      Hide
      Amareshwari Sriramadasu added a comment -

      Eariler patch added one more parameter to FileOutputFormat.getDefaultWorkFile() and SequenceFileOutputFormat.getSequenceWriter().
      Attached patch adds earlier methods with default value for the new parameter and deprecates them.

      Show
      Amareshwari Sriramadasu added a comment - Eariler patch added one more parameter to FileOutputFormat.getDefaultWorkFile() and SequenceFileOutputFormat.getSequenceWriter(). Attached patch adds earlier methods with default value for the new parameter and deprecates them.
      Hide
      Chris Douglas added a comment -

      I reverted this

      Show
      Chris Douglas added a comment - I reverted this
      Hide
      Arun C Murthy added a comment -

      Hmmm... since this went in just now and is an incompatible change I'd propose we revert it and fix it correctly.

      Show
      Arun C Murthy added a comment - Hmmm... since this went in just now and is an incompatible change I'd propose we revert it and fix it correctly.
      Hide
      Chris Douglas added a comment -

      This is an incompatible change. It removed a public API FileOutputFormat::getDefaultWorkFile(TaskAttemptContext,String). The javadoc for the new method is also wrong. It also removes a protected API in SequenceFileOutputFormat.

      Are some changes enhancements to MultipleOutputs outside of upgrading it to use the new API? If so, they should be part of a separate issue.

      Show
      Chris Douglas added a comment - This is an incompatible change. It removed a public API FileOutputFormat::getDefaultWorkFile(TaskAttemptContext,String) . The javadoc for the new method is also wrong. It also removes a protected API in SequenceFileOutputFormat . Are some changes enhancements to MultipleOutputs outside of upgrading it to use the new API? If so, they should be part of a separate issue.
      Hide
      Sharad Agarwal added a comment -

      I just committed this. Thanks Amareshwari!

      Show
      Sharad Agarwal added a comment - I just committed this. Thanks Amareshwari!
      Hide
      Sharad Agarwal added a comment -

      +1

      Show
      Sharad Agarwal added a comment - +1
      Hide
      Amareshwari Sriramadasu added a comment -

      -1 core tests. Due to test failure TestRecoveryManager (MAPREDUCE-880)

      Show
      Amareshwari Sriramadasu added a comment - -1 core tests. Due to test failure TestRecoveryManager ( MAPREDUCE-880 )
      Hide
      Hadoop QA added a comment -

      -1 overall. Here are the results of testing the latest attachment
      http://issues.apache.org/jira/secure/attachment/12417469/patch-370-2.txt
      against trunk revision 807123.

      +1 @author. The patch does not contain any @author tags.

      +1 tests included. The patch appears to include 3 new or modified tests.

      +1 javadoc. The javadoc tool did not generate any warning messages.

      +1 javac. The applied patch does not increase the total number of javac compiler warnings.

      +1 findbugs. The patch does not introduce any new Findbugs warnings.

      +1 release audit. The applied patch does not increase the total number of release audit warnings.

      -1 core tests. The patch failed core unit tests.

      +1 contrib tests. The patch passed contrib unit tests.

      Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/508/testReport/
      Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/508/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
      Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/508/artifact/trunk/build/test/checkstyle-errors.html
      Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/508/console

      This message is automatically generated.

      Show
      Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12417469/patch-370-2.txt against trunk revision 807123. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/508/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/508/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/508/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/508/console This message is automatically generated.
      Hide
      Tom White added a comment -

      I think these users can override FileOutputFormat.getDefaultWorkFile to control the precise filename.

      This is true. So to have complete control over the output filename you would call the write method with a base output path of the name you want (possibly using the key and value to construct it). You would then override FileOutputFormat.getDefaultWorkFile() to omit the

      {m,r}

      -nnnnn suffix.

      We could make this slightly easier in the future perhaps (by putting it in the MultipleOutputs API, for example), but I think the current approach is reasonable.

      Show
      Tom White added a comment - I think these users can override FileOutputFormat.getDefaultWorkFile to control the precise filename. This is true. So to have complete control over the output filename you would call the write method with a base output path of the name you want (possibly using the key and value to construct it). You would then override FileOutputFormat.getDefaultWorkFile() to omit the {m,r} -nnnnn suffix. We could make this slightly easier in the future perhaps (by putting it in the MultipleOutputs API, for example), but I think the current approach is reasonable.
      Hide
      Amareshwari Sriramadasu added a comment -

      Patch changing checkTokenName() and checkbaseOutputPath() to be private.

      Show
      Amareshwari Sriramadasu added a comment - Patch changing checkTokenName() and checkbaseOutputPath() to be private.
      Hide
      Amareshwari Sriramadasu added a comment -

      Could the counter name be based on the named output, rather than the base filename?

      Possible. But counters will be maintained only for named outputs.

      but this change I'm proposing would allow advanced users to control the precise filename of the outputs.

      I think these users can override FileOutputFormat.getDefaultWorkFile to control the precise filename.

      Show
      Amareshwari Sriramadasu added a comment - Could the counter name be based on the named output, rather than the base filename? Possible. But counters will be maintained only for named outputs. but this change I'm proposing would allow advanced users to control the precise filename of the outputs. I think these users can override FileOutputFormat.getDefaultWorkFile to control the precise filename.
      Hide
      Tom White added a comment -

      Could the counter name be based on the named output, rather than the base filename?

      if user doesn't give unique name for the output file, there are chances that output will be garbled.

      This is true, but like MultipleOutputFormat it would be up to the application to give unique names to the output files. Most users would use the simpler form that takes a named output and lets MultipleOutputs construct the output filename {{

      {namedOutput}

      (m|r)

      {part-number}

      }}, but this change I'm proposing would allow advanced users to control the precise filename of the outputs.

      Show
      Tom White added a comment - Could the counter name be based on the named output, rather than the base filename? if user doesn't give unique name for the output file, there are chances that output will be garbled. This is true, but like MultipleOutputFormat it would be up to the application to give unique names to the output files. Most users would use the simpler form that takes a named output and lets MultipleOutputs construct the output filename {{ {namedOutput} (m|r) {part-number} }}, but this change I'm proposing would allow advanced users to control the precise filename of the outputs.
      Hide
      Amareshwari Sriramadasu added a comment -

      I think that there should be the ability to have complete control over the output filename, much as MultipleOutputFormat does. To achieve this we could change the baseOutputPath parameter in the write methods to be a full output path. The user application would be reponsible for making sure there are no name clashes - this is like the functionality available in MultipleOutputFormat today. The overloaded version is available if the user doesn't care so much about the output filenames, which will then have a {m,r}-nnnnn suffix. Does this make sense?

      Tom, I did not do this, because MultipleOutputs has a feature for maintaining counters, which counts the number of records written to each output name. If we take full output name from user, aggregating these counters at job level is not straight forward. Also, if user doesn't give unique name for the output file, there are chances that output will be garbled. So, I thought taking baseOutputName (which is the counter name also) from user and constructing full output filename by the framework would be the right solution. Don't you think this is right?

      I think that there should be the ability to have complete control over the output filename, much as MultipleOutputFormat does.

      With current patch, user has complete control over the path. Just that whatever path he chooses, the file name is <baseOutputPath>m/r<part-number>

      Show
      Amareshwari Sriramadasu added a comment - I think that there should be the ability to have complete control over the output filename, much as MultipleOutputFormat does. To achieve this we could change the baseOutputPath parameter in the write methods to be a full output path. The user application would be reponsible for making sure there are no name clashes - this is like the functionality available in MultipleOutputFormat today. The overloaded version is available if the user doesn't care so much about the output filenames, which will then have a {m,r}-nnnnn suffix. Does this make sense? Tom, I did not do this, because MultipleOutputs has a feature for maintaining counters, which counts the number of records written to each output name. If we take full output name from user, aggregating these counters at job level is not straight forward. Also, if user doesn't give unique name for the output file, there are chances that output will be garbled. So, I thought taking baseOutputName (which is the counter name also) from user and constructing full output filename by the framework would be the right solution. Don't you think this is right? I think that there should be the ability to have complete control over the output filename, much as MultipleOutputFormat does. With current patch, user has complete control over the path. Just that whatever path he chooses, the file name is <baseOutputPath> m/r <part-number>
      Hide
      Tom White added a comment -

      This is looking good.

      I think that there should be the ability to have complete control over the output filename, much as MultipleOutputFormat does. To achieve this we could change the baseOutputPath parameter in the write methods to be a full output path. The user application would be reponsible for making sure there are no name clashes - this is like the functionality available in MultipleOutputFormat today. The overloaded version is available if the user doesn't care so much about the output filenames, which will then have a

      {m,r}

      -nnnnn suffix. Does this make sense?

      Also, we should find a way of not exposing MultipleOutputs#checkBaseOutputPath() and checkTokenName() as public methods since they are only needed internally by the framework.

      Show
      Tom White added a comment - This is looking good. I think that there should be the ability to have complete control over the output filename, much as MultipleOutputFormat does. To achieve this we could change the baseOutputPath parameter in the write methods to be a full output path. The user application would be reponsible for making sure there are no name clashes - this is like the functionality available in MultipleOutputFormat today. The overloaded version is available if the user doesn't care so much about the output filenames, which will then have a {m,r} -nnnnn suffix. Does this make sense? Also, we should find a way of not exposing MultipleOutputs#checkBaseOutputPath() and checkTokenName() as public methods since they are only needed internally by the framework.
      Hide
      Amareshwari Sriramadasu added a comment -

      -1 core tests : due to MAPREDUCE-880
      -1 contrib tests : due to MAPREDUCE-699

      Show
      Amareshwari Sriramadasu added a comment - -1 core tests : due to MAPREDUCE-880 -1 contrib tests : due to MAPREDUCE-699
      Hide
      Hadoop QA added a comment -

      -1 overall. Here are the results of testing the latest attachment
      http://issues.apache.org/jira/secure/attachment/12417117/patch-370-1.txt
      against trunk revision 806065.

      +1 @author. The patch does not contain any @author tags.

      +1 tests included. The patch appears to include 3 new or modified tests.

      +1 javadoc. The javadoc tool did not generate any warning messages.

      +1 javac. The applied patch does not increase the total number of javac compiler warnings.

      +1 findbugs. The patch does not introduce any new Findbugs warnings.

      +1 release audit. The applied patch does not increase the total number of release audit warnings.

      -1 core tests. The patch failed core unit tests.

      -1 contrib tests. The patch failed contrib unit tests.

      Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/497/testReport/
      Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/497/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
      Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/497/artifact/trunk/build/test/checkstyle-errors.html
      Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/497/console

      This message is automatically generated.

      Show
      Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12417117/patch-370-1.txt against trunk revision 806065. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/497/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/497/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/497/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/497/console This message is automatically generated.
      Hide
      Amareshwari Sriramadasu added a comment -

      Patch with review comments from Sharad incorporated.

      Show
      Amareshwari Sriramadasu added a comment - Patch with review comments from Sharad incorporated.
      Hide
      Sharad Agarwal added a comment -

      If we do this, this will remove generate* methods from the api proposed.

      I am +1 on removing generate*. This will simplify the api and overriding of MultipleOutputs is not required in case user wants to control the output path name.

      Show
      Sharad Agarwal added a comment - If we do this, this will remove generate* methods from the api proposed. I am +1 on removing generate*. This will simplify the api and overriding of MultipleOutputs is not required in case user wants to control the output path name.
      Hide
      Amareshwari Sriramadasu added a comment -

      To achieve this, I think we could port MultipleOutputs, and change the semantics of getCollector() in the multi name case, so that the multi name is the full name of the name of the output file. This method is typically invoked in the reduce() method, where the key and value are available, and can be used to form the name.

      If we do this, this will remove generate* methods from the api proposed. And api for writing would look like :

      public <K,V> void write(String namedOutput,  K key, V value, String outputPath)
                throws IOException, InterruptedException;
      public <K,V> void write(String namedOutput,  K key, V value)
                throws IOException, InterruptedException;
      public <K,V> void write( K key, V value, String outputPath)
                throws IOException, InterruptedException;
      

      let me know if this looks fine.

      Show
      Amareshwari Sriramadasu added a comment - To achieve this, I think we could port MultipleOutputs, and change the semantics of getCollector() in the multi name case, so that the multi name is the full name of the name of the output file. This method is typically invoked in the reduce() method, where the key and value are available, and can be used to form the name. If we do this, this will remove generate* methods from the api proposed. And api for writing would look like : public <K,V> void write( String namedOutput, K key, V value, String outputPath) throws IOException, InterruptedException; public <K,V> void write( String namedOutput, K key, V value) throws IOException, InterruptedException; public <K,V> void write( K key, V value, String outputPath) throws IOException, InterruptedException; let me know if this looks fine.
      Hide
      Amareshwari Sriramadasu added a comment -

      To achieve this, I think we could port MultipleOutputs, and change the semantics of getCollector() in the multi name case, so that the multi name is the full name of the name of the output file. This method is typically invoked in the reduce() method, where the key and value are available, and can be used to form the name.

      Tom, are you saying that we should not have a protected method to generateOutputName(), which could be overridden to give the functionality. If so, we should have a way to find out whether it is namedOutput (i meant multiNamedOutputs) or an arbitrary name, to know which output format should be used for writing.
      We should have something like :

        public <K,V> void write(String namedOutput, String outputPath, K key, V value)
                throws IOException, InterruptedException;
        public <K,V> void write(String outputPath, K key, V value)
                throws IOException, InterruptedException;
      

      Applications that want to add a unique suffix can call FileOutputFormat#getUniqueFile() themselves.

      This should be done by the framework to support counters as explained earlier.

      Show
      Amareshwari Sriramadasu added a comment - To achieve this, I think we could port MultipleOutputs, and change the semantics of getCollector() in the multi name case, so that the multi name is the full name of the name of the output file. This method is typically invoked in the reduce() method, where the key and value are available, and can be used to form the name. Tom, are you saying that we should not have a protected method to generateOutputName(), which could be overridden to give the functionality. If so, we should have a way to find out whether it is namedOutput (i meant multiNamedOutputs) or an arbitrary name, to know which output format should be used for writing. We should have something like : public <K,V> void write( String namedOutput, String outputPath, K key, V value) throws IOException, InterruptedException; public <K,V> void write( String outputPath, K key, V value) throws IOException, InterruptedException; Applications that want to add a unique suffix can call FileOutputFormat#getUniqueFile() themselves. This should be done by the framework to support counters as explained earlier.
      Hide
      Amareshwari Sriramadasu added a comment -

      Attaching an early patch.

      Patch does the following:
      1. Adds an api in org.apache.hadoop.mapreduce.lib.output.FileOutputFormat to get RecordWriter by taking the filename. Current api does not support passing a filename.

      2. Adds org.apache.hadoop.mapreduce.lib.output.MultipleOutputs with following api :

      public class MultipleOutputs<KEYOUT, VALUEOUT>  {
      
        public MultipleOutputs(TaskInputOutputContext context);
      
         // Adds a named output for the job.
        public static void addNamedOutput(Job job, String namedOutput,
            Class<? extends FileOutputFormat> outputFormatClass,
            Class<?> keyClass, Class<?> valueClass) ;
      
        // Enables counters for named outputs
        public static void setCountersEnabled(Job job, boolean enabled);
      
        // Write to a named output. 
        // write to an output file name that depends on key, value, context and namedoutput
        // gets the record writer from output format added for the named output 
        public <K,V> void write(String namedOutput, K key, V value)
                throws IOException, InterruptedException;
      
        // Writes to  an output file name that depends on key, value and context
        // gets the record writer from job's outputformat.  
        //Job's output format should be a FileOutputFormat. 
        public  void write(KEYOUT key, VALUEOUT value) 
                throws IOException, InterruptedException;
      
        protected <K,V>String generateOutputName(K  key, V value,
            TaskAttemptContext context, String name);
      
        protected <K,V> K generateActualKey(K key, V value) ;
        protected <K,V> V generateActualValue(K key, V value);
      

      User can add namedOutputs and corresponding OutputFormat, Output key/value types using addNamedOutput.
      generateOutputName api can be overridden by the user to give final output name. This gives the complete control of the output name to the user. Generating unique file-name can done once user gives this name (can be done in framework it self) as done in the patch. This facilitates the available counter feature to count the number of records written to each output name. The same method can be used to plug-in the functionality of multiNamedOutputs.

      I illustrated using the api, in the added test-case.

      3. Deprecates org.apache.hadoop.mapred.lib.Multiple*Output*

      Show
      Amareshwari Sriramadasu added a comment - Attaching an early patch. Patch does the following: 1. Adds an api in org.apache.hadoop.mapreduce.lib.output.FileOutputFormat to get RecordWriter by taking the filename. Current api does not support passing a filename. 2. Adds org.apache.hadoop.mapreduce.lib.output.MultipleOutputs with following api : public class MultipleOutputs<KEYOUT, VALUEOUT> { public MultipleOutputs(TaskInputOutputContext context); // Adds a named output for the job. public static void addNamedOutput(Job job, String namedOutput, Class <? extends FileOutputFormat> outputFormatClass, Class <?> keyClass, Class <?> valueClass) ; // Enables counters for named outputs public static void setCountersEnabled(Job job, boolean enabled); // Write to a named output. // write to an output file name that depends on key, value, context and namedoutput // gets the record writer from output format added for the named output public <K,V> void write( String namedOutput, K key, V value) throws IOException, InterruptedException; // Writes to an output file name that depends on key, value and context // gets the record writer from job's outputformat. //Job's output format should be a FileOutputFormat. public void write(KEYOUT key, VALUEOUT value) throws IOException, InterruptedException; protected <K,V> String generateOutputName(K key, V value, TaskAttemptContext context, String name); protected <K,V> K generateActualKey(K key, V value) ; protected <K,V> V generateActualValue(K key, V value); User can add namedOutputs and corresponding OutputFormat, Output key/value types using addNamedOutput. generateOutputName api can be overridden by the user to give final output name. This gives the complete control of the output name to the user. Generating unique file-name can done once user gives this name (can be done in framework it self) as done in the patch. This facilitates the available counter feature to count the number of records written to each output name. The same method can be used to plug-in the functionality of multiNamedOutputs. I illustrated using the api, in the added test-case. 3. Deprecates org.apache.hadoop.mapred.lib.Multiple*Output*
      Hide
      Tom White added a comment -

      The only feature that MultipleOutputs needs to make it at least as powerful as MultipleOutputFormat is the ability to control the output file name. At present the MultipleOutputs file name is

      <namedOutput>_<multiName>-(m|r)-<part-number>

      whereas in MultipleOutputFormat you have complete control over the naming, including the ability to create subdirectories by having a path separator (/) in the name.

      To achieve this, I think we could port MultipleOutputs, and change the semantics of getCollector() in the multi name case, so that the multi name is the full name of the name of the output file. This method is typically invoked in the reduce() method, where the key and value are available, and can be used to form the name. Applications that want to add a unique suffix can call FileOutputFormat#getUniqueFile() themselves.

      The single name case would work as before and create a single output file for a named output.

      Show
      Tom White added a comment - The only feature that MultipleOutputs needs to make it at least as powerful as MultipleOutputFormat is the ability to control the output file name. At present the MultipleOutputs file name is <namedOutput>_<multiName>-(m|r)-<part-number> whereas in MultipleOutputFormat you have complete control over the naming, including the ability to create subdirectories by having a path separator ( / ) in the name. To achieve this, I think we could port MultipleOutputs, and change the semantics of getCollector() in the multi name case, so that the multi name is the full name of the name of the output file. This method is typically invoked in the reduce() method, where the key and value are available, and can be used to form the name. Applications that want to add a unique suffix can call FileOutputFormat#getUniqueFile() themselves. The single name case would work as before and create a single output file for a named output.
      Hide
      Amareshwari Sriramadasu added a comment -

      I propose following api for org.apache.hadoop.mapreduce.lib.output.MultipleOutputs combining both org.apache.hadoop.mapred.lib.MultipleOutputs and org.apache.hadoop.mapred.lib.MultipleOutputFormat :

      public class MultipleOutputs<KEYOUT, VALUEOUT>  {
      
        public MultipleOutputs(TaskInputOutputContext context) {
        }
      
         // Adds a named output for the job.
        public static void addNamedOutput(Job job, String namedOutput,
            Class<? extends FileOutputFormat> outputFormatClass,
            Class<?> keyClass, Class<?> valueClass) ;
      
        // Adds a mutli-named output for the job
        public static void addMultiNamedOutput(Job job, String namedOutput,
            Class<? extends FileOutputFormat> outputFormatClass,
            Class<?> keyClass, Class<?> valueClass);
      
        // Enables counters for named outputs
        public static void setCountersEnabled(Job job, boolean enabled);
      
        // Write to a named output.
        public <K,V> void write(String namedOutput, K key, V value)
                throws IOException, InterruptedException;
      
        // write to a multi-named output
        public <K,V> void write(String namedOutput, String multiName, K key, V value) 
               throws IOException, InterruptedException;
      
        // Writes to  an output file name that depends on key, value and context
        // gets the record writer from job's outputformat.  
        //Job's output format should be a FileOutputFormat. Throws IOException if it is not.
        public  void write(KEYOUT key, VALUEOUT value) 
                throws IOException, InterruptedException;
      
        // close all outputs
        public void close() throws IOException, InterruptedException;
      
        /**
         * Generate the leaf name for the output file name. The default behavior does
         * not change the leaf file name (such as part-*-00000)
         */
        protected String generateLeafFileName(String name) {
          return name;
        }
      
        /**
         * Generate the file output file name based on key, value, context and the leaf file
         * name. The default behavior does not depend on key, value and context.
         */
        protected String generateFileName(KEYOUT  key, VALUEOUT value,
            TaskAttemptContext context, String name) {
          return name;
        }
      
        /**
         * Generate the actual key from the given key/value. The default behavior is that
         * the actual key is equal to the given key
         * 
         * @param key
         *          the key of the output data
         * @param value
         *          the value of the output data
         * @return the actual key derived from the given key/value
         */
        protected <K,V> K generateActualKey(K key, V value) {
          return key;
        }
        
        /**
         * Generate the actual value from the given key and value. 
         * The default behavior is that
         * the actual value is equal to the given value
         * 
         * @param key
         *          the key of the output data
         * @param value
         *          the value of the output data
         * @return the actual value derived from the given key/value
         */
        protected <K,V> V generateActualValue(K key, V value) {
          return value;
        }
      }
      

      Thoughts?

      Show
      Amareshwari Sriramadasu added a comment - I propose following api for org.apache.hadoop.mapreduce.lib.output.MultipleOutputs combining both org.apache.hadoop.mapred.lib.MultipleOutputs and org.apache.hadoop.mapred.lib.MultipleOutputFormat : public class MultipleOutputs<KEYOUT, VALUEOUT> { public MultipleOutputs(TaskInputOutputContext context) { } // Adds a named output for the job. public static void addNamedOutput(Job job, String namedOutput, Class <? extends FileOutputFormat> outputFormatClass, Class <?> keyClass, Class <?> valueClass) ; // Adds a mutli-named output for the job public static void addMultiNamedOutput(Job job, String namedOutput, Class <? extends FileOutputFormat> outputFormatClass, Class <?> keyClass, Class <?> valueClass); // Enables counters for named outputs public static void setCountersEnabled(Job job, boolean enabled); // Write to a named output. public <K,V> void write( String namedOutput, K key, V value) throws IOException, InterruptedException; // write to a multi-named output public <K,V> void write( String namedOutput, String multiName, K key, V value) throws IOException, InterruptedException; // Writes to an output file name that depends on key, value and context // gets the record writer from job's outputformat. //Job's output format should be a FileOutputFormat. Throws IOException if it is not. public void write(KEYOUT key, VALUEOUT value) throws IOException, InterruptedException; // close all outputs public void close() throws IOException, InterruptedException; /** * Generate the leaf name for the output file name. The default behavior does * not change the leaf file name (such as part-*-00000) */ protected String generateLeafFileName( String name) { return name; } /** * Generate the file output file name based on key, value, context and the leaf file * name. The default behavior does not depend on key, value and context. */ protected String generateFileName(KEYOUT key, VALUEOUT value, TaskAttemptContext context, String name) { return name; } /** * Generate the actual key from the given key/value. The default behavior is that * the actual key is equal to the given key * * @param key * the key of the output data * @param value * the value of the output data * @ return the actual key derived from the given key/value */ protected <K,V> K generateActualKey(K key, V value) { return key; } /** * Generate the actual value from the given key and value. * The default behavior is that * the actual value is equal to the given value * * @param key * the key of the output data * @param value * the value of the output data * @ return the actual value derived from the given key/value */ protected <K,V> V generateActualValue(K key, V value) { return value; } } Thoughts?
      Hide
      Tom White added a comment -

      It would be great if we could merge the functionality of MultipleOutputs and MultipleOutputFormat as a part of this migration, since having two is confusing for users.

      Show
      Tom White added a comment - It would be great if we could merge the functionality of MultipleOutputs and MultipleOutputFormat as a part of this migration, since having two is confusing for users.

        People

        • Assignee:
          Amareshwari Sriramadasu
          Reporter:
          Amareshwari Sriramadasu
        • Votes:
          0 Vote for this issue
          Watchers:
          7 Start watching this issue

          Dates

          • Created:
            Updated:
            Resolved:

            Development