Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-983

PERFORMANCE: multi-query optimization on multiple group bys following a join or cogroup

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.6.0
    • impl
    • None
    • Reviewed

    Description

      The current multi-query optimizer works well with pig scripts like this one:

      data = LOAD 'input' AS (a:chararray, b:int, c:int);
      A = GROUP data BY b;
      B = GROUP data BY c;
      C = FOREACH A GENERATE group, COUNT(data);
      D = FOREACH B GENERATE group, SUM(data.b);
      STORE C INTO 'output1';
      STORE D INTO 'output2';
      

      In this case the original three Map-Reduce jobs are merged into one MR job by the optimizer.

      The current optimizer, however, won't reduce the number of MR jobs for the scripts in which multiple group bys follow a join or a cogroup, such as this one:

      data1 = LOAD 'input1' AS (a1:chararray, b1:int, c1:int);
      data2 = LOAD 'input2' AS (a2:chararray, b2:int, c2:int);
      A = JOIN data1 BY a1, data2 BY a2;
      B = GROUP A BY data1::b1;
      C = GROUP B BY data2::c2;
      D = FOREACH B GENERATE group, COUNT(A);
      E = FOREACH C GENERATE group, SUM(A.data2::b2);
      STORE D INTO 'output1';
      STORE E INTO 'output2';                        
      

      Three MR jobs are still needed to run this script.

      Multi-query optimizer should work with this kind of scripts by merging the group bys and reducing the overall MR jobs.

      Attachments

        1. PIG-983.patch
          9 kB
          Richard Ding

        Activity

          People

            rding Richard Ding
            rding Richard Ding
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: