Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-750

Use combiner when algebraic UDFs are used in expressions

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.9.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      With changes in the patch, queries which have algebraic functions within expressions also will use combiner. This is as long as the bags from group-by are only input for algebraic expressions. If bag is projected or a non algebraic expression/udf has bag as input, combiner will not be used.
      Combiner will be used in case of following foreach statements (that follow group) -
      describe B ;
      B: {group: int, A: {c1 : int, c2 : int, c3 : int}}

      1) foreach B generate SUM(A.c2) * AVG(A.c3), ...
      2) foreach B generate 1 / SUM(A.c2)
      3) foreach B generate EXP(AVG(A.c2))
      4) foreach B generate group + SUM(A.c2)


      Following statements will not use combiner -
      1) foreach B generate A.c2, ...
      2) foreach B generate EXP(c2) , SUM(c2) ... - Where EXP is non algebraic function

      In case of nested foreach statement, if it has limit, order, or filter , combiner does not get used (as before).

      This patch also fixes PIG-490, foreach statements that access group elements also use combiner
      for example -
      1) foreach B generate group.$0, group.$1, COUNT(A);
      1) foreach B generate group.c1, group.c2, COUNT(A);
      Show
      With changes in the patch, queries which have algebraic functions within expressions also will use combiner. This is as long as the bags from group-by are only input for algebraic expressions. If bag is projected or a non algebraic expression/udf has bag as input, combiner will not be used. Combiner will be used in case of following foreach statements (that follow group) - describe B ; B: {group: int, A: {c1 : int, c2 : int, c3 : int}} 1) foreach B generate SUM(A.c2) * AVG(A.c3), ... 2) foreach B generate 1 / SUM(A.c2) 3) foreach B generate EXP(AVG(A.c2)) 4) foreach B generate group + SUM(A.c2) Following statements will not use combiner - 1) foreach B generate A.c2, ... 2) foreach B generate EXP(c2) , SUM(c2) ... - Where EXP is non algebraic function In case of nested foreach statement, if it has limit, order, or filter , combiner does not get used (as before). This patch also fixes PIG-490 , foreach statements that access group elements also use combiner for example - 1) foreach B generate group.$0, group.$1, COUNT(A); 1) foreach B generate group.c1, group.c2, COUNT(A);

      Description

      Currently Pig uses combiner when all a,b, c,... are algebraic (e.g. SUM, AVG etc.) in foreach:

      foreach X generate a,b,c,...

      It's a performance improvement if it uses combiner when a mix of algebraic and non-algebraic functions are used as well.

        Attachments

        1. PIG-750.1.patch
          67 kB
          Thejas M Nair

          Activity

            People

            • Assignee:
              thejas Thejas M Nair
              Reporter:
              amirhyoussefi Amir Youssefi
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: