Pig
  1. Pig
  2. PIG-750

Use combiner when algebraic UDFs are used in expressions

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.9.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      With changes in the patch, queries which have algebraic functions within expressions also will use combiner. This is as long as the bags from group-by are only input for algebraic expressions. If bag is projected or a non algebraic expression/udf has bag as input, combiner will not be used.
      Combiner will be used in case of following foreach statements (that follow group) -
      describe B ;
      B: {group: int, A: {c1 : int, c2 : int, c3 : int}}

      1) foreach B generate SUM(A.c2) * AVG(A.c3), ...
      2) foreach B generate 1 / SUM(A.c2)
      3) foreach B generate EXP(AVG(A.c2))
      4) foreach B generate group + SUM(A.c2)


      Following statements will not use combiner -
      1) foreach B generate A.c2, ...
      2) foreach B generate EXP(c2) , SUM(c2) ... - Where EXP is non algebraic function

      In case of nested foreach statement, if it has limit, order, or filter , combiner does not get used (as before).

      This patch also fixes PIG-490, foreach statements that access group elements also use combiner
      for example -
      1) foreach B generate group.$0, group.$1, COUNT(A);
      1) foreach B generate group.c1, group.c2, COUNT(A);
      Show
      With changes in the patch, queries which have algebraic functions within expressions also will use combiner. This is as long as the bags from group-by are only input for algebraic expressions. If bag is projected or a non algebraic expression/udf has bag as input, combiner will not be used. Combiner will be used in case of following foreach statements (that follow group) - describe B ; B: {group: int, A: {c1 : int, c2 : int, c3 : int}} 1) foreach B generate SUM(A.c2) * AVG(A.c3), ... 2) foreach B generate 1 / SUM(A.c2) 3) foreach B generate EXP(AVG(A.c2)) 4) foreach B generate group + SUM(A.c2) Following statements will not use combiner - 1) foreach B generate A.c2, ... 2) foreach B generate EXP(c2) , SUM(c2) ... - Where EXP is non algebraic function In case of nested foreach statement, if it has limit, order, or filter , combiner does not get used (as before). This patch also fixes PIG-490 , foreach statements that access group elements also use combiner for example - 1) foreach B generate group.$0, group.$1, COUNT(A); 1) foreach B generate group.c1, group.c2, COUNT(A);

      Description

      Currently Pig uses combiner when all a,b, c,... are algebraic (e.g. SUM, AVG etc.) in foreach:

      foreach X generate a,b,c,...

      It's a performance improvement if it uses combiner when a mix of algebraic and non-algebraic functions are used as well.

      1. PIG-750.1.patch
        67 kB
        Thejas M Nair

        Activity

        Hide
        Scott Carey added a comment -

        Awesome!

        This is a huge improvement for several of my use cases.

        Show
        Scott Carey added a comment - Awesome! This is a huge improvement for several of my use cases.
        Hide
        Thejas M Nair added a comment -

        Patch committed to trunk.

        Show
        Thejas M Nair added a comment - Patch committed to trunk.
        Hide
        Daniel Dai added a comment -

        +1. Please put release notes.

        Show
        Daniel Dai added a comment - +1. Please put release notes.
        Hide
        Thejas M Nair added a comment -

        PIG-750.1.patch - This patch also fixes jira PIG-490 .
        In test patch results, the release audit warning is a false alarm, it is complaining about a generated docs diff file.
        [exec] -1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] +1 tests included. The patch appears to include 6 new or modified tests.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
        [exec]
        [exec] -1 release audit. The applied patch generated 501 release audit warnings (more than the trunk's current 500 warnings).

        I have run unit tests all pass, but I am running it again after some cosmetic changes.

        Show
        Thejas M Nair added a comment - PIG-750 .1.patch - This patch also fixes jira PIG-490 . In test patch results, the release audit warning is a false alarm, it is complaining about a generated docs diff file. [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 501 release audit warnings (more than the trunk's current 500 warnings). I have run unit tests all pass, but I am running it again after some cosmetic changes.
        Hide
        Alan Gates added a comment -

        Our performance tests have shown that having combiner and non-combiner functions in the same MR job actually severly slows things down. We suspect that this is because you have to pass the bags for the non-combiner functions through the combiner and you pay for the multiple (de)serialization passes.

        However, the other things noted in this bug, such as the need to use the combiner when algebraic UDFs are involved in simple expressions is valid, and is along the lines of issues Thejas is working on for the combiner. So I'm assigning the issue to him.

        Show
        Alan Gates added a comment - Our performance tests have shown that having combiner and non-combiner functions in the same MR job actually severly slows things down. We suspect that this is because you have to pass the bags for the non-combiner functions through the combiner and you pay for the multiple (de)serialization passes. However, the other things noted in this bug, such as the need to use the combiner when algebraic UDFs are involved in simple expressions is valid, and is along the lines of issues Thejas is working on for the combiner. So I'm assigning the issue to him.
        Hide
        David Ciemiewicz added a comment -

        Also consider the application of a scalar function to the result of an aggregation function:

        3) foreach X generate EXP(AVG(b))

        Show
        David Ciemiewicz added a comment - Also consider the application of a scalar function to the result of an aggregation function: 3) foreach X generate EXP(AVG(b))
        Hide
        Amir Youssefi added a comment -

        Other use-cases we need have in unit tests:

        1) foreach X generate SUM(a) * AVG(b), ...

        2) foreach X generate 1 / SUM(a)

        Currently, there is a work-around suggested to have all algebraic functions calculated in a foreach and then more expressions/mixes are calculated in a second foreach. This way combiner is used in the first foreach and we get combiner speed-up.

        Show
        Amir Youssefi added a comment - Other use-cases we need have in unit tests: 1) foreach X generate SUM(a) * AVG(b), ... 2) foreach X generate 1 / SUM(a) Currently, there is a work-around suggested to have all algebraic functions calculated in a foreach and then more expressions/mixes are calculated in a second foreach. This way combiner is used in the first foreach and we get combiner speed-up.

          People

          • Assignee:
            Thejas M Nair
            Reporter:
            Amir Youssefi
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development