Hadoop Common
  1. Hadoop Common
  2. HADOOP-4842

Streaming combiner should allow command, not just JavaClass

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Streaming option -combiner allows any streaming command (not just Java class) to be a combiner.

      Description

      Streaming jobs are way slower than Java jobs for many reasons, but certainly stopping the shell-only programmer from using the combiner feature won't help. Right now, the streaming usage says:

      -mapper <cmd|JavaClassName> The streaming command to run
      -combiner <JavaClassName> Combiner has to be a Java class
      -reducer <cmd|JavaClassName> The streaming command to run

      1. patch-4842-3.txt
        7 kB
        Amareshwari Sriramadasu
      2. patch-4842-2.txt
        8 kB
        Amareshwari Sriramadasu
      3. patch-4842-1.txt
        6 kB
        Amareshwari Sriramadasu
      4. patch-4842.txt
        6 kB
        Amareshwari Sriramadasu

        Activity

        Marco Nicosia created issue -
        Amareshwari Sriramadasu made changes -
        Field Original Value New Value
        Assignee Amareshwari Sriramadasu [ amareshwari ]
        Fix Version/s 0.21.0 [ 12313563 ]
        Hide
        Klaas Bosteels added a comment -

        Actually, shell-only programmers can already combine by adding something like "| sort | sh combiner.sh" to their mapper script. More generally, I think it makes more sense to combine locally in the streaming application process itself, instead of running an additional application process and requiring another round trip to the Java process and back. Both Pipes and Dumbo use this approach for combining.

        Show
        Klaas Bosteels added a comment - Actually, shell-only programmers can already combine by adding something like "| sort | sh combiner.sh" to their mapper script. More generally, I think it makes more sense to combine locally in the streaming application process itself, instead of running an additional application process and requiring another round trip to the Java process and back. Both Pipes and Dumbo use this approach for combining.
        Hide
        Amareshwari Sriramadasu added a comment -

        Owen's idea of using PipeReducer for Combiner worked fine.
        Attaching patch with -combiner option accepting any streaming command.

        Patch does the following:

        • Added PipeCombiner class which extends PipeReducer and overrides method getPipeCommand(JobConf) to return combiner command.
        • If -combiner option is not a java class, PipeCombiner.class is set as the Combiner class and the command is passed in configuration through property "stream.combine.streamprocessor"
        • Modified documentation for the change
        • Added a test to TestStreaming to run with a combiner.
        Show
        Amareshwari Sriramadasu added a comment - Owen's idea of using PipeReducer for Combiner worked fine. Attaching patch with -combiner option accepting any streaming command. Patch does the following: Added PipeCombiner class which extends PipeReducer and overrides method getPipeCommand(JobConf) to return combiner command. If -combiner option is not a java class, PipeCombiner.class is set as the Combiner class and the command is passed in configuration through property "stream.combine.streamprocessor" Modified documentation for the change Added a test to TestStreaming to run with a combiner.
        Amareshwari Sriramadasu made changes -
        Attachment patch-4842.txt [ 12401013 ]
        Amareshwari Sriramadasu made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Giridharan Kesavan added a comment -

        hudson coudnt update the logs as jira was down.. so I'm copying it from hudson logs.

        [exec] -1 overall. Here are the results of testing the latest attachment
        [exec] http://issues.apache.org/jira/secure/attachment/12401013/patch-4842.txt
        [exec] against trunk revision 748381.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] +1 tests included. The patch appears to include 3 new or modified tests.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
        [exec]
        [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
        [exec]
        [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
        [exec]
        [exec] +1 core tests. The patch passed core unit tests.
        [exec]
        [exec] -1 contrib tests. The patch failed contrib unit tests.
        [exec]
        [exec] Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta/8/testReport/
        [exec] Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta/8/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        [exec] Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta/8/artifact/trunk/build/test/checkstyle-errors.html
        [exec] Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta/8/console

        Show
        Giridharan Kesavan added a comment - hudson coudnt update the logs as jira was down.. so I'm copying it from hudson logs. [exec] -1 overall. Here are the results of testing the latest attachment [exec] http://issues.apache.org/jira/secure/attachment/12401013/patch-4842.txt [exec] against trunk revision 748381. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] +1 core tests. The patch passed core unit tests. [exec] [exec] -1 contrib tests. The patch failed contrib unit tests. [exec] [exec] Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta/8/testReport/ [exec] Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta/8/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html [exec] Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta/8/artifact/trunk/build/test/checkstyle-errors.html [exec] Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta/8/console
        Amareshwari Sriramadasu made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Amareshwari Sriramadasu added a comment -

        test-failure was due to TestStreaming holding Combiner test inside. I separated it into another class.
        All contrib tests passed on my machine.

        Show
        Amareshwari Sriramadasu added a comment - test-failure was due to TestStreaming holding Combiner test inside. I separated it into another class. All contrib tests passed on my machine.
        Amareshwari Sriramadasu made changes -
        Attachment patch-4842-1.txt [ 12401102 ]
        Amareshwari Sriramadasu made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12401102/patch-4842-1.txt
        against trunk revision 748861.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/24/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/24/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/24/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/24/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12401102/patch-4842-1.txt against trunk revision 748861. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/24/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/24/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/24/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/24/console This message is automatically generated.
        Hide
        Amareshwari Sriramadasu added a comment -

        Test failure org.apache.hadoop.chukwa.datacollection.adaptor.filetailer.TestStartAtOffset.testStartAtOffset is not related to the patch

        Show
        Amareshwari Sriramadasu added a comment - Test failure org.apache.hadoop.chukwa.datacollection.adaptor.filetailer.TestStartAtOffset.testStartAtOffset is not related to the patch
        Amareshwari Sriramadasu made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Amareshwari Sriramadasu added a comment -

        Patch changing the testcase to do word-count in which combiner does aggregation and reducer is identity. Also valdiated combiner counters in the test.

        Show
        Amareshwari Sriramadasu added a comment - Patch changing the testcase to do word-count in which combiner does aggregation and reducer is identity. Also valdiated combiner counters in the test.
        Amareshwari Sriramadasu made changes -
        Attachment patch-4842-2.txt [ 12402126 ]
        Amareshwari Sriramadasu made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12402126/patch-4842-2.txt
        against trunk revision 753113.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/61/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/61/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/61/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/61/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12402126/patch-4842-2.txt against trunk revision 753113. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/61/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/61/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/61/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/61/console This message is automatically generated.
        Hide
        Amareshwari Sriramadasu added a comment -

        Canceling patch since the test is failing on windows. Using shell-script in test fails on windows saying "CreateProcess error=193, %1 is not a valid Win32 application".

        Show
        Amareshwari Sriramadasu added a comment - Canceling patch since the test is failing on windows. Using shell-script in test fails on windows saying "CreateProcess error=193, %1 is not a valid Win32 application".
        Amareshwari Sriramadasu made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Amareshwari Sriramadasu added a comment -

        Patch with earlier combiner test, with combiner counters validated in the test.

        Show
        Amareshwari Sriramadasu added a comment - Patch with earlier combiner test, with combiner counters validated in the test.
        Amareshwari Sriramadasu made changes -
        Attachment patch-4842-3.txt [ 12402355 ]
        Amareshwari Sriramadasu made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Amareshwari Sriramadasu added a comment -

        On a 10 node cluster, I ran a job(finding unique words in input) with and without the combiner, runtimes are 1 min 40 sec and 5mins, 58sec repsectively.

        Show
        Amareshwari Sriramadasu added a comment - On a 10 node cluster, I ran a job(finding unique words in input) with and without the combiner, runtimes are 1 min 40 sec and 5mins, 58sec repsectively.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12402355/patch-4842-3.txt
        against trunk revision 755057.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/95/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/95/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/95/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/95/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12402355/patch-4842-3.txt against trunk revision 755057. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/95/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/95/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/95/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/95/console This message is automatically generated.
        Hide
        Sharad Agarwal added a comment -

        +1

        Show
        Sharad Agarwal added a comment - +1
        Hide
        Devaraj Das added a comment -

        I just committed this. Thanks, Amareshwari!

        Show
        Devaraj Das added a comment - I just committed this. Thanks, Amareshwari!
        Devaraj Das made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Hadoop Flags [Reviewed]
        Resolution Fixed [ 1 ]
        Amareshwari Sriramadasu made changes -
        Release Note -combiner option in steaming allows any streaming command to be a combiner.
        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Hadoop-trunk #785 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/785/ )
        Owen O'Malley made changes -
        Component/s contrib/streaming [ 12310972 ]
        Hide
        Robert Chansler added a comment -

        Editorial pass over all release notes prior to publication of 0.21.

        Show
        Robert Chansler added a comment - Editorial pass over all release notes prior to publication of 0.21.
        Robert Chansler made changes -
        Release Note -combiner option in steaming allows any streaming command to be a combiner. Streaming option -combiner allows any streaming command (not just Java class) to be a combiner.
        Tom White made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Antonio Piccolboni added a comment -

        I entered a comment on HADOOP-1722 that may be of interest here too. The problem seems to be that binary formats and streaming combiners don't work well together particularly if one want the reducer to read typedbytes and write text. If the combiner does the same then we have the combiner write text while the reducer expects typedbytes. Trying to understand what the expected behavior is before I submit a bug.

        Show
        Antonio Piccolboni added a comment - I entered a comment on HADOOP-1722 that may be of interest here too. The problem seems to be that binary formats and streaming combiners don't work well together particularly if one want the reducer to read typedbytes and write text. If the combiner does the same then we have the combiner write text while the reducer expects typedbytes. Trying to understand what the expected behavior is before I submit a bug.
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Patch Available Patch Available Open Open
        18d 15h 35m 3 Amareshwari Sriramadasu 17/Mar/09 06:57
        Open Open Patch Available Patch Available
        76d 16h 17m 4 Amareshwari Sriramadasu 17/Mar/09 07:00
        Patch Available Patch Available Resolved Resolved
        2d 4h 20m 1 Devaraj Das 19/Mar/09 11:20
        Resolved Resolved Closed Closed
        523d 9h 13m 1 Tom White 24/Aug/10 20:34

          People

          • Assignee:
            Amareshwari Sriramadasu
            Reporter:
            Marco Nicosia
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development