Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-2410

document multiple keys per reducer oddity in hadoop streaming FAQ

    Details

    • Hadoop Flags:
      Reviewed
    • Release Note:
      Add an FAQ entry regarding the differences between Java API and Streaming development of MR programs.
    • Tags:
      streaming

      Description

      Hi,
      for a newcomer to hadoop streaming, it comes as a surprise that the reducer receives arbitrary keys, unlike the "real" hadoop where a reducer works on a single key.
      An explanation for this is @ http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201103.mbox/browser

      I suggest to add this to the FAQ of hadoop streaming

      1. MAPREDUCE-2410.r1.diff
        1.0 kB
        Harsh J
      2. MAPREDUCE-2410.r2.diff
        1 kB
        Harsh J
      3. MAPREDUCE-2410.r3.diff
        1 kB
        Harsh J

        Activity

        Hide
        Harsh J added a comment -

        Dieter,

        I've attached a patch that adds a documentation entry to the streaming's FAQ page.

        Let me know if the following is sufficient (its what the patch contains as well):

        
        +<section>
        +<title>How does the use of streaming differ from the Java MapReduce API?</title>
        +<p>
        +    The Java MapReduce API provides a higher level API that lets the developer focus on writing map and reduce functions that act upon a pair of key and associated value(s). The Java API takes care of the iteration over the data source behind the scenes.
        +    In streaming, the framework pours in the input data over the stdin to the mapper/reduce program, and thus these programs ought to be written from the reading (via stdin) iteration level.
        +</p>
        +</section>
        
        
        Show
        Harsh J added a comment - Dieter, I've attached a patch that adds a documentation entry to the streaming's FAQ page. Let me know if the following is sufficient (its what the patch contains as well): +<section> +<title>How does the use of streaming differ from the Java MapReduce API?</title> +<p> + The Java MapReduce API provides a higher level API that lets the developer focus on writing map and reduce functions that act upon a pair of key and associated value(s). The Java API takes care of the iteration over the data source behind the scenes. + In streaming, the framework pours in the input data over the stdin to the mapper/reduce program, and thus these programs ought to be written from the reading (via stdin) iteration level. +</p> +</section>
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12478785/MAPREDUCE-2410.r1.diff
        against trunk revision 1101741.

        +1 @author. The patch does not contain any @author tags.

        +0 tests included. The patch appears to be a documentation patch that doesn't require tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/229//testReport/
        Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/229//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/229//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12478785/MAPREDUCE-2410.r1.diff against trunk revision 1101741. +1 @author. The patch does not contain any @author tags. +0 tests included. The patch appears to be a documentation patch that doesn't require tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/229//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/229//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/229//console This message is automatically generated.
        Hide
        Dieter Plaetinck added a comment -

        I think that's very well and concisely explained, but to make it really clear to beginners I would add after the last line:
        "A practical consequence of this is that reducers for streaming need to be able to deal with different input keys"

        Or even:
        "A practical consequence of this is that reducers for streaming need to be able to deal with different input keys, although some projects exist to provide a similar abstract API on top of the streaming API, such as dumbo for python programmers [*]"

        [*] https://github.com/klbostee/dumbo/wiki/Short-tutorial

        Show
        Dieter Plaetinck added a comment - I think that's very well and concisely explained, but to make it really clear to beginners I would add after the last line: "A practical consequence of this is that reducers for streaming need to be able to deal with different input keys" Or even: "A practical consequence of this is that reducers for streaming need to be able to deal with different input keys, although some projects exist to provide a similar abstract API on top of the streaming API, such as dumbo for python programmers [*] " [*] https://github.com/klbostee/dumbo/wiki/Short-tutorial
        Hide
        Harsh J added a comment -

        Incorporated the suggestion (#1) from Dieter.

        Would adding a link to an external be a good idea from the ASF POV? I think its better if that goes to the wiki pages instead? I may be being too paranoid, so let me know

        Show
        Harsh J added a comment - Incorporated the suggestion (#1) from Dieter. Would adding a link to an external be a good idea from the ASF POV? I think its better if that goes to the wiki pages instead? I may be being too paranoid, so let me know
        Hide
        Dieter Plaetinck added a comment -

        That's something I can't decide for you. I'm still fairly new to hadoop (and the ASF), you are probably more aware of these kind of policies and/or habits. Maybe it would be ideal to make a wiki page "frameworks that build on top of hadoop" or something, and link to that. (but i'm only aware of dumbo)

        Show
        Dieter Plaetinck added a comment - That's something I can't decide for you. I'm still fairly new to hadoop (and the ASF), you are probably more aware of these kind of policies and/or habits. Maybe it would be ideal to make a wiki page "frameworks that build on top of hadoop" or something, and link to that. (but i'm only aware of dumbo)
        Hide
        Harsh J added a comment -

        Add a wiki page link.

        Show
        Harsh J added a comment - Add a wiki page link.
        Hide
        Harsh J added a comment -
        Show
        Harsh J added a comment - Dieter - I've added some info here: http://wiki.apache.org/hadoop/HadoopStreaming/AlternativeInterfaces
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12478845/MAPREDUCE-2410.r3.diff
        against trunk revision 1101741.

        +1 @author. The patch does not contain any @author tags.

        +0 tests included. The patch appears to be a documentation patch that doesn't require tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/232//testReport/
        Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/232//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/232//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12478845/MAPREDUCE-2410.r3.diff against trunk revision 1101741. +1 @author. The patch does not contain any @author tags. +0 tests included. The patch appears to be a documentation patch that doesn't require tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/232//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/232//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/232//console This message is automatically generated.
        Hide
        Todd Lipcon added a comment -

        Committed to trunk and 0.22 (doc fix). Thanks, Harsh and Dieter!

        Show
        Todd Lipcon added a comment - Committed to trunk and 0.22 (doc fix). Thanks, Harsh and Dieter!
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-22-branch #48 (See https://builds.apache.org/hudson/job/Hadoop-Mapreduce-22-branch/48/)
        MAPREDUCE-2410. Add entry to streaming FAQ about how streaming reducers receive keys. Contributed by Harsh J Chouraria.

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-22-branch #48 (See https://builds.apache.org/hudson/job/Hadoop-Mapreduce-22-branch/48/ ) MAPREDUCE-2410 . Add entry to streaming FAQ about how streaming reducers receive keys. Contributed by Harsh J Chouraria.
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #679 (See https://builds.apache.org/hudson/job/Hadoop-Mapreduce-trunk/679/)

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #679 (See https://builds.apache.org/hudson/job/Hadoop-Mapreduce-trunk/679/ )
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk-Commit #675 (See https://builds.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/675/)

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #675 (See https://builds.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/675/ )

          People

          • Assignee:
            Harsh J
            Reporter:
            Dieter Plaetinck
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 40m
              40m
              Remaining:
              Remaining Estimate - 40m
              40m
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development