Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1065

Modify the mapred tutorial documentation to use new mapreduce api.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.21.0
    • Fix Version/s: 0.21.0
    • Component/s: documentation
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    1. MAPREDUCE-1065.3.patch
      138 kB
      Aaron Kimball
    2. MAPREDUCE-1065.2.patch
      136 kB
      Aaron Kimball
    3. MAPREDUCE-1065.patch
      129 kB
      Aaron Kimball

      Activity

      Hide
      Chris Douglas added a comment -

      +1 I committed this. Thanks, Aaron!

      Show
      Chris Douglas added a comment - +1 I committed this. Thanks, Aaron!
      Hide
      Hadoop QA added a comment -

      +1 overall. Here are the results of testing the latest attachment
      http://issues.apache.org/jira/secure/attachment/12439921/MAPREDUCE-1065.3.patch
      against trunk revision 928104.

      +1 @author. The patch does not contain any @author tags.

      +0 tests included. The patch appears to be a documentation patch that doesn't require tests.

      +1 javadoc. The javadoc tool did not generate any warning messages.

      +1 javac. The applied patch does not increase the total number of javac compiler warnings.

      +1 findbugs. The patch does not introduce any new Findbugs warnings.

      +1 release audit. The applied patch does not increase the total number of release audit warnings.

      +1 core tests. The patch passed core unit tests.

      +1 contrib tests. The patch passed contrib unit tests.

      Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/64/testReport/
      Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/64/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
      Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/64/artifact/trunk/build/test/checkstyle-errors.html
      Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/64/console

      This message is automatically generated.

      Show
      Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12439921/MAPREDUCE-1065.3.patch against trunk revision 928104. +1 @author. The patch does not contain any @author tags. +0 tests included. The patch appears to be a documentation patch that doesn't require tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/64/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/64/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/64/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/64/console This message is automatically generated.
      Hide
      Aaron Kimball added a comment -

      Thanks for the thorough read-through, Chris. I've made the suggested changes.

      Show
      Aaron Kimball added a comment - Thanks for the thorough read-through, Chris. I've made the suggested changes.
      Hide
      Chris Douglas added a comment -

      Thanks for taking on this task. A few points that should be updated with this, but are not necessarily related to the new API follow. A full edit (making "how many

      {maps,reduces}

      " guidance more helpful, etc.) can be part of a separate issue, but it would be nice to correct flagrant errors.

      • This has not been true for some time (0.17?):

        The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.

      • It should be noted that combine can run zero or more times in the "Inputs and Outputs" section (combine* may be sufficient)
      • Though probably obvious, it may be helpful to note what grouping will occur without a grouping comparator in the Mapper subsection
      • Removing "then" is more accurate in this sentence:

        The Mapper outputs are sorted and then partitioned per Reducer.

      • The link here:

        While some job parameters are straight-forward to set (e.g. setNumReduceTasks(int)), other parameters interact subtly with the rest of the framework and/or job configuration and are more complex to set (e.g. setNumMapTasks(int)).

        is broken (there is no Job::setNumMapTasks)

      • In the Map Parameters section, the reference to io.sort.buffer.spill.percent should be mapreduce.map.sort.spill.percent
      Show
      Chris Douglas added a comment - Thanks for taking on this task. A few points that should be updated with this, but are not necessarily related to the new API follow. A full edit (making "how many {maps,reduces} " guidance more helpful, etc.) can be part of a separate issue, but it would be nice to correct flagrant errors. This has not been true for some time (0.17?): The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework. It should be noted that combine can run zero or more times in the "Inputs and Outputs" section ( combine* may be sufficient) Though probably obvious, it may be helpful to note what grouping will occur without a grouping comparator in the Mapper subsection Removing "then" is more accurate in this sentence: The Mapper outputs are sorted and then partitioned per Reducer. The link here: While some job parameters are straight-forward to set (e.g. setNumReduceTasks(int)), other parameters interact subtly with the rest of the framework and/or job configuration and are more complex to set (e.g. setNumMapTasks(int)). is broken (there is no Job::setNumMapTasks ) In the Map Parameters section, the reference to io.sort.buffer.spill.percent should be mapreduce.map.sort.spill.percent
      Hide
      Hadoop QA added a comment -

      +1 overall. Here are the results of testing the latest attachment
      http://issues.apache.org/jira/secure/attachment/12437951/MAPREDUCE-1065.2.patch
      against trunk revision 919268.

      +1 @author. The patch does not contain any @author tags.

      +0 tests included. The patch appears to be a documentation patch that doesn't require tests.

      +1 javadoc. The javadoc tool did not generate any warning messages.

      +1 javac. The applied patch does not increase the total number of javac compiler warnings.

      +1 findbugs. The patch does not introduce any new Findbugs warnings.

      +1 release audit. The applied patch does not increase the total number of release audit warnings.

      +1 core tests. The patch passed core unit tests.

      +1 contrib tests. The patch passed contrib unit tests.

      Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/501/testReport/
      Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/501/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
      Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/501/artifact/trunk/build/test/checkstyle-errors.html
      Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/501/console

      This message is automatically generated.

      Show
      Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12437951/MAPREDUCE-1065.2.patch against trunk revision 919268. +1 @author. The patch does not contain any @author tags. +0 tests included. The patch appears to be a documentation patch that doesn't require tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/501/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/501/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/501/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/501/console This message is automatically generated.
      Hide
      Aaron Kimball added a comment -

      Attaching patch that addresses the issues from the above review.

      Show
      Aaron Kimball added a comment - Attaching patch that addresses the issues from the above review.
      Hide
      Aaron Kimball added a comment -

      Amareshwari,

      Thanks for looking over the changes. I don't think that bin/hdfs dfs is correct though – Hadoop's filesystem operations are implementation-independent. hadoop fs -ls works just as well on a file:///-based host as an hdfs://-based host. As such, the behavior is still in bin/hadoop and not bin/hdfs (see the print_usage() function in each). The 'dfs' submodule has been deprecated in favor of the more generic 'fs' submodule for some time now. Similarly, the 'jar' submodule is still in bin/hadoop as running a Hadoop-aware program is not necessarily mapreduce-specific.

      I'll submit a new patch soon that addresses your other points.

      Show
      Aaron Kimball added a comment - Amareshwari, Thanks for looking over the changes. I don't think that bin/hdfs dfs is correct though – Hadoop's filesystem operations are implementation-independent. hadoop fs -ls works just as well on a file:/// -based host as an hdfs:// -based host. As such, the behavior is still in bin/hadoop and not bin/hdfs (see the print_usage() function in each). The 'dfs' submodule has been deprecated in favor of the more generic 'fs' submodule for some time now. Similarly, the 'jar' submodule is still in bin/hadoop as running a Hadoop-aware program is not necessarily mapreduce-specific. I'll submit a new patch soon that addresses your other points.
      Hide
      Amareshwari Sriramadasu added a comment -

      Some comments:

      • We should use scripts bin/hdfs for dfs commands, bin/mapred for MAPREDUCE commands, instead of bin/hadoop script.
        For ex:
        The following change is looks wrong:
        -          <code>$ bin/hadoop dfs -ls /usr/joe/wordcount/input/</code><br/>
        +          <code>$ bin/hadoop fs -ls /user/joe/wordcount/input/</code><br/>
        

        It should changed as

        -          <code>$ bin/hadoop dfs -ls /usr/joe/wordcount/input/</code><br/>
        +          <code>$ bin/hdfs dfs -ls /user/joe/wordcount/input/</code><br/>
        
      • Output part files are named part-m/r-<id> in new api.
        The following change should be modified to reflect the same.
        -            $ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000
        +            $ bin/hadoop fs -cat /user/joe/wordcount/output/part-00000
        

        It should be changed to

        -            $ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000
        +            $ bin/hdfs dfs -cat /user/joe/wordcount/output/part-r-00000
        
      • Minor - I guess we make sure lines are 80 characters long in documentation also.
      • Job.setNumMapTasks(int) is not linked to api doc.
      Show
      Amareshwari Sriramadasu added a comment - Some comments: We should use scripts bin/hdfs for dfs commands, bin/mapred for MAPREDUCE commands, instead of bin/hadoop script. For ex: The following change is looks wrong: - <code>$ bin/hadoop dfs -ls /usr/joe/wordcount/input/</code><br/> + <code>$ bin/hadoop fs -ls /user/joe/wordcount/input/</code><br/> It should changed as - <code>$ bin/hadoop dfs -ls /usr/joe/wordcount/input/</code><br/> + <code>$ bin/hdfs dfs -ls /user/joe/wordcount/input/</code><br/> Output part files are named part-m/r-<id> in new api. The following change should be modified to reflect the same. - $ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000 + $ bin/hadoop fs -cat /user/joe/wordcount/output/part-00000 It should be changed to - $ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000 + $ bin/hdfs dfs -cat /user/joe/wordcount/output/part-r-00000 Minor - I guess we make sure lines are 80 characters long in documentation also. Job.setNumMapTasks(int) is not linked to api doc.
      Hide
      Aaron Kimball added a comment -

      The test failure, of course, is unrelated

      Show
      Aaron Kimball added a comment - The test failure, of course, is unrelated
      Hide
      Hadoop QA added a comment -

      -1 overall. Here are the results of testing the latest attachment
      http://issues.apache.org/jira/secure/attachment/12437089/MAPREDUCE-1065.patch
      against trunk revision 916823.

      +1 @author. The patch does not contain any @author tags.

      +0 tests included. The patch appears to be a documentation patch that doesn't require tests.

      +1 javadoc. The javadoc tool did not generate any warning messages.

      +1 javac. The applied patch does not increase the total number of javac compiler warnings.

      +1 findbugs. The patch does not introduce any new Findbugs warnings.

      +1 release audit. The applied patch does not increase the total number of release audit warnings.

      -1 core tests. The patch failed core unit tests.

      +1 contrib tests. The patch passed contrib unit tests.

      Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/492/testReport/
      Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/492/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
      Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/492/artifact/trunk/build/test/checkstyle-errors.html
      Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/492/console

      This message is automatically generated.

      Show
      Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12437089/MAPREDUCE-1065.patch against trunk revision 916823. +1 @author. The patch does not contain any @author tags. +0 tests included. The patch appears to be a documentation patch that doesn't require tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/492/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/492/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/492/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/492/console This message is automatically generated.
      Hide
      Aaron Kimball added a comment -

      Updated mapred_tutorial to use the new API throughout. Updated site.xml to point to javadoc for new entries as well. No automated tests because this is documentation. I clicked all the MapReduce API links to ensure that they head to the correct javadoc page. I also ran both the new wordcount programs (v1 and v2) on a local pseudodistributed instance to ensure they compile and run correctly as-is.

      Show
      Aaron Kimball added a comment - Updated mapred_tutorial to use the new API throughout. Updated site.xml to point to javadoc for new entries as well. No automated tests because this is documentation. I clicked all the MapReduce API links to ensure that they head to the correct javadoc page. I also ran both the new wordcount programs (v1 and v2) on a local pseudodistributed instance to ensure they compile and run correctly as-is.

        People

        • Assignee:
          Aaron Kimball
          Reporter:
          Amareshwari Sriramadasu
        • Votes:
          0 Vote for this issue
          Watchers:
          3 Start watching this issue

          Dates

          • Created:
            Updated:
            Resolved:

            Development