Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1772

Hadoop streaming doc should not use IdentityMapper as an example

    Details

    • Hadoop Flags:
      Reviewed

      Description

      From the URL http://hadoop.apache.org/core/docs/current/streaming.html

      This example doesn't work:

      You can supply a Java class as the mapper and/or the reducer. The above example is equivalent to:

      $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
      -input myInputDirs \
      -output myOutputDir \
      -mapper org.apache.hadoop.mapred.lib.IdentityMapper \
      -reducer /bin/wc

      This will produce the following exception:

      java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable

      1. patch-1772-2.txt
        18 kB
        Amareshwari Sriramadasu
      2. patch-1772-1.txt
        17 kB
        Amareshwari Sriramadasu
      3. patch-1772.txt
        15 kB
        Amareshwari Sriramadasu

        Activity

        Hide
        Amareshwari Sriramadasu added a comment -

        Karam reported some more issues with streaming documentation. Posting them here on his behalf. The issues are :

        1. In many examples, system commands such /bin/cat or /bin/wc are used. Can we change them to 'cat' or 'wc' instead of prefixing them with '/bin', as sometimes utilities do not reside in /bin? Especially command such as /bin/wc is not there in /bin, it is under /usr/bin/wc.
        2. In some places, for usage we are using "bin/hadoop command [genericOptions] [streamingOptions]". it should be changed to "$HADOOP_HOME/bin/hadoop command [genericOptions] [streamingOptions]"
        3. There are places where we set -D mapred.reduce.tasks=0 where there is -numReduceTasks is also for the same. From the user's view, it will be good to replace -D mapred.reduce.tasks=0 to -numReduceTasks 0 or In usage we should mention that -numReduceTasks and -D mapred.reduce.tasks are the same and either or of them can be used to specify reduces. Also we need to add note as -numReduceTasks appears in streaming option after generic options it will over-ride -Dmapred.reduce.tasks if both are specified in command.
        4. In the examples for KeyFieldBasedPartitioner and KeyFieldBasedComparator, we should change mapper from IdentityMapper to cat, because in case of IdentityMapper map output key will be written and is LongWritable, so the result is not same as expected.
        5. c2='cut -f2'; $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
              -D mapred.job.name='Experiment'
              -input /user/me/samples/student_marks 
              -output /user/me/samples/student_out 
              -mapper \"$c2\" -reducer 'cat' 
          

          In the above example, -mapper \"$c2\" should be written as "$c2" otherwise streaming interprets -f2 as invalid option and fails.

        Show
        Amareshwari Sriramadasu added a comment - Karam reported some more issues with streaming documentation. Posting them here on his behalf. The issues are : In many examples, system commands such /bin/cat or /bin/wc are used. Can we change them to 'cat' or 'wc' instead of prefixing them with '/bin', as sometimes utilities do not reside in /bin? Especially command such as /bin/wc is not there in /bin, it is under /usr/bin/wc. In some places, for usage we are using "bin/hadoop command [genericOptions] [streamingOptions] ". it should be changed to "$HADOOP_HOME/bin/hadoop command [genericOptions] [streamingOptions] " There are places where we set -D mapred.reduce.tasks=0 where there is -numReduceTasks is also for the same. From the user's view, it will be good to replace -D mapred.reduce.tasks=0 to -numReduceTasks 0 or In usage we should mention that -numReduceTasks and -D mapred.reduce.tasks are the same and either or of them can be used to specify reduces. Also we need to add note as -numReduceTasks appears in streaming option after generic options it will over-ride -Dmapred.reduce.tasks if both are specified in command. In the examples for KeyFieldBasedPartitioner and KeyFieldBasedComparator, we should change mapper from IdentityMapper to cat, because in case of IdentityMapper map output key will be written and is LongWritable, so the result is not same as expected. c2='cut -f2'; $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -D mapred.job.name='Experiment' -input /user/me/samples/student_marks -output /user/me/samples/student_out -mapper \ "$c2\" -reducer 'cat' In the above example, -mapper \"$c2\" should be written as "$c2" otherwise streaming interprets -f2 as invalid option and fails.
        Hide
        Amareshwari Sriramadasu added a comment -

        Attached patch corrects the documentation.

        Show
        Amareshwari Sriramadasu added a comment - Attached patch corrects the documentation.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12448966/patch-1772.txt
        against trunk revision 961578.

        +1 @author. The patch does not contain any @author tags.

        +0 tests included. The patch appears to be a documentation patch that doesn't require tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/594/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/594/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/594/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/594/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12448966/patch-1772.txt against trunk revision 961578. +1 @author. The patch does not contain any @author tags. +0 tests included. The patch appears to be a documentation patch that doesn't require tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/594/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/594/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/594/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/594/console This message is automatically generated.
        Hide
        Ravi Gummadi added a comment -

        In documentation, many occurences of "$HADOOP_HOME/hadoop-streaming.jar" is seen. Would you please modify it to hadoop-streaming.jar sothat we don't mention wrong path of the jar ?

        Show
        Ravi Gummadi added a comment - In documentation, many occurences of "$HADOOP_HOME/hadoop-streaming.jar" is seen. Would you please modify it to hadoop-streaming.jar sothat we don't mention wrong path of the jar ?
        Hide
        Amareshwari Sriramadasu added a comment -

        Replaced occurrences of "$HADOOP_HOME/hadoop-streaming.jar" with hadoop-streaming.jar.

        Ran ant docs successfully with the patch.

        Show
        Amareshwari Sriramadasu added a comment - Replaced occurrences of "$HADOOP_HOME/hadoop-streaming.jar" with hadoop-streaming.jar. Ran ant docs successfully with the patch.
        Hide
        Ravi Gummadi added a comment -

        Patch looks good.
        +1

        Show
        Ravi Gummadi added a comment - Patch looks good. +1
        Hide
        Amareshwari Sriramadasu added a comment -

        One minor correction in the doc: removed '\' in the command ending with " -file myDictionary.txt \"

        Show
        Amareshwari Sriramadasu added a comment - One minor correction in the doc: removed '\' in the command ending with " -file myDictionary.txt \"
        Hide
        Amareshwari Sriramadasu added a comment -

        I just committed this.

        Show
        Amareshwari Sriramadasu added a comment - I just committed this.
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk-Commit #523 (See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/523/)

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #523 (See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/523/ )

          People

          • Assignee:
            Amareshwari Sriramadasu
            Reporter:
            Marco Nicosia
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development