Hadoop Common
  1. Hadoop Common
  2. HADOOP-4842

Streaming combiner should allow command, not just JavaClass

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Streaming option -combiner allows any streaming command (not just Java class) to be a combiner.

      Description

      Streaming jobs are way slower than Java jobs for many reasons, but certainly stopping the shell-only programmer from using the combiner feature won't help. Right now, the streaming usage says:

      -mapper <cmd|JavaClassName> The streaming command to run
      -combiner <JavaClassName> Combiner has to be a Java class
      -reducer <cmd|JavaClassName> The streaming command to run

      1. patch-4842.txt
        6 kB
        Amareshwari Sriramadasu
      2. patch-4842-1.txt
        6 kB
        Amareshwari Sriramadasu
      3. patch-4842-2.txt
        8 kB
        Amareshwari Sriramadasu
      4. patch-4842-3.txt
        7 kB
        Amareshwari Sriramadasu

        Activity

        Hide
        Klaas Bosteels added a comment -

        Actually, shell-only programmers can already combine by adding something like "| sort | sh combiner.sh" to their mapper script. More generally, I think it makes more sense to combine locally in the streaming application process itself, instead of running an additional application process and requiring another round trip to the Java process and back. Both Pipes and Dumbo use this approach for combining.

        Show
        Klaas Bosteels added a comment - Actually, shell-only programmers can already combine by adding something like "| sort | sh combiner.sh" to their mapper script. More generally, I think it makes more sense to combine locally in the streaming application process itself, instead of running an additional application process and requiring another round trip to the Java process and back. Both Pipes and Dumbo use this approach for combining.
        Hide
        Amareshwari Sriramadasu added a comment -

        Owen's idea of using PipeReducer for Combiner worked fine.
        Attaching patch with -combiner option accepting any streaming command.

        Patch does the following:

        • Added PipeCombiner class which extends PipeReducer and overrides method getPipeCommand(JobConf) to return combiner command.
        • If -combiner option is not a java class, PipeCombiner.class is set as the Combiner class and the command is passed in configuration through property "stream.combine.streamprocessor"
        • Modified documentation for the change
        • Added a test to TestStreaming to run with a combiner.
        Show
        Amareshwari Sriramadasu added a comment - Owen's idea of using PipeReducer for Combiner worked fine. Attaching patch with -combiner option accepting any streaming command. Patch does the following: Added PipeCombiner class which extends PipeReducer and overrides method getPipeCommand(JobConf) to return combiner command. If -combiner option is not a java class, PipeCombiner.class is set as the Combiner class and the command is passed in configuration through property "stream.combine.streamprocessor" Modified documentation for the change Added a test to TestStreaming to run with a combiner.
        Hide
        Giridharan Kesavan added a comment -

        hudson coudnt update the logs as jira was down.. so I'm copying it from hudson logs.

        [exec] -1 overall. Here are the results of testing the latest attachment
        [exec] http://issues.apache.org/jira/secure/attachment/12401013/patch-4842.txt
        [exec] against trunk revision 748381.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] +1 tests included. The patch appears to include 3 new or modified tests.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
        [exec]
        [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
        [exec]
        [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
        [exec]
        [exec] +1 core tests. The patch passed core unit tests.
        [exec]
        [exec] -1 contrib tests. The patch failed contrib unit tests.
        [exec]
        [exec] Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta/8/testReport/
        [exec] Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta/8/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        [exec] Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta/8/artifact/trunk/build/test/checkstyle-errors.html
        [exec] Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta/8/console

        Show
        Giridharan Kesavan added a comment - hudson coudnt update the logs as jira was down.. so I'm copying it from hudson logs. [exec] -1 overall. Here are the results of testing the latest attachment [exec] http://issues.apache.org/jira/secure/attachment/12401013/patch-4842.txt [exec] against trunk revision 748381. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] +1 core tests. The patch passed core unit tests. [exec] [exec] -1 contrib tests. The patch failed contrib unit tests. [exec] [exec] Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta/8/testReport/ [exec] Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta/8/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html [exec] Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta/8/artifact/trunk/build/test/checkstyle-errors.html [exec] Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta/8/console
        Hide
        Amareshwari Sriramadasu added a comment -

        test-failure was due to TestStreaming holding Combiner test inside. I separated it into another class.
        All contrib tests passed on my machine.

        Show
        Amareshwari Sriramadasu added a comment - test-failure was due to TestStreaming holding Combiner test inside. I separated it into another class. All contrib tests passed on my machine.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12401102/patch-4842-1.txt
        against trunk revision 748861.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/24/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/24/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/24/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/24/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12401102/patch-4842-1.txt against trunk revision 748861. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/24/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/24/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/24/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/24/console This message is automatically generated.
        Hide
        Amareshwari Sriramadasu added a comment -

        Test failure org.apache.hadoop.chukwa.datacollection.adaptor.filetailer.TestStartAtOffset.testStartAtOffset is not related to the patch

        Show
        Amareshwari Sriramadasu added a comment - Test failure org.apache.hadoop.chukwa.datacollection.adaptor.filetailer.TestStartAtOffset.testStartAtOffset is not related to the patch
        Hide
        Amareshwari Sriramadasu added a comment -

        Patch changing the testcase to do word-count in which combiner does aggregation and reducer is identity. Also valdiated combiner counters in the test.

        Show
        Amareshwari Sriramadasu added a comment - Patch changing the testcase to do word-count in which combiner does aggregation and reducer is identity. Also valdiated combiner counters in the test.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12402126/patch-4842-2.txt
        against trunk revision 753113.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/61/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/61/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/61/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/61/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12402126/patch-4842-2.txt against trunk revision 753113. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/61/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/61/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/61/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/61/console This message is automatically generated.
        Hide
        Amareshwari Sriramadasu added a comment -

        Canceling patch since the test is failing on windows. Using shell-script in test fails on windows saying "CreateProcess error=193, %1 is not a valid Win32 application".

        Show
        Amareshwari Sriramadasu added a comment - Canceling patch since the test is failing on windows. Using shell-script in test fails on windows saying "CreateProcess error=193, %1 is not a valid Win32 application".
        Hide
        Amareshwari Sriramadasu added a comment -

        Patch with earlier combiner test, with combiner counters validated in the test.

        Show
        Amareshwari Sriramadasu added a comment - Patch with earlier combiner test, with combiner counters validated in the test.
        Hide
        Amareshwari Sriramadasu added a comment -

        On a 10 node cluster, I ran a job(finding unique words in input) with and without the combiner, runtimes are 1 min 40 sec and 5mins, 58sec repsectively.

        Show
        Amareshwari Sriramadasu added a comment - On a 10 node cluster, I ran a job(finding unique words in input) with and without the combiner, runtimes are 1 min 40 sec and 5mins, 58sec repsectively.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12402355/patch-4842-3.txt
        against trunk revision 755057.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/95/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/95/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/95/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/95/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12402355/patch-4842-3.txt against trunk revision 755057. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/95/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/95/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/95/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/95/console This message is automatically generated.
        Hide
        Sharad Agarwal added a comment -

        +1

        Show
        Sharad Agarwal added a comment - +1
        Hide
        Devaraj Das added a comment -

        I just committed this. Thanks, Amareshwari!

        Show
        Devaraj Das added a comment - I just committed this. Thanks, Amareshwari!
        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Hadoop-trunk #785 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/785/ )
        Hide
        Robert Chansler added a comment -

        Editorial pass over all release notes prior to publication of 0.21.

        Show
        Robert Chansler added a comment - Editorial pass over all release notes prior to publication of 0.21.
        Hide
        Antonio Piccolboni added a comment -

        I entered a comment on HADOOP-1722 that may be of interest here too. The problem seems to be that binary formats and streaming combiners don't work well together particularly if one want the reducer to read typedbytes and write text. If the combiner does the same then we have the combiner write text while the reducer expects typedbytes. Trying to understand what the expected behavior is before I submit a bug.

        Show
        Antonio Piccolboni added a comment - I entered a comment on HADOOP-1722 that may be of interest here too. The problem seems to be that binary formats and streaming combiners don't work well together particularly if one want the reducer to read typedbytes and write text. If the combiner does the same then we have the combiner write text while the reducer expects typedbytes. Trying to understand what the expected behavior is before I submit a bug.

          People

          • Assignee:
            Amareshwari Sriramadasu
            Reporter:
            Marco Nicosia
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development