Pig
  1. Pig
  2. PIG-4066

An optimization for ROLLUP operation in Pig

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.15.0
    • Component/s: None
    • Patch Info:
      Patch Available

      Description

      This patch aims at addressing the current limitation of the ROLLUP operator in PIG: most of the work is done in the Map phase of the underlying MapReduce job to generate all possible intermediate keys that the reducer use to aggregate and produce the ROLLUP output. Based on our previous work: “Duy-Hung Phan, Matteo Dell’Amico, Pietro Michiardi: On the design space of MapReduce ROLLUP aggregates” (http://www.eurecom.fr/en/publication/4212/download/rs-publi-4212_2.pdf), we show that the design space for a ROLLUP implementation allows for a different approach (in-reducer grouping, IRG), in which less work is done in the Map phase and the grouping is done in the Reduce phase. This patch presents the most efficient implementation we designed (Hybrid IRG), which allows defining a parameter to balance between parallelism (in the reducers) and communication cost.
      This patch contains the following features:
      1. The new ROLLUP approach: IRG, Hybrid IRG.
      2. The PIVOT clause in CUBE operators.
      3. Test cases.
      The new syntax to use our ROLLUP approach:
      alias = CUBE rel BY

      { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]} [, { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]}

      ...]
      In case there is multiple ROLLUP operator in one CUBE clause, the last ROLLUP operator will be executed with our approach (IRG, Hybrid IRG) while the remaining ROLLUP ahead will be executed with the default approach.
      We have already made some experiments for comparison between our ROLLUP implementation and the current ROLLUP. More information can be found at here: http://hxquangnhat.github.io/PIG-ROLLUP-H2IRG/
      Patch can be reviewed at here: https://reviews.apache.org/r/23804/

      1. UserGuide.pdf
        87 kB
        Quang-Nhat HOANG-XUAN
      2. TechnicalNotes.pdf
        182 kB
        Quang-Nhat HOANG-XUAN
      3. TechnicalNotes.2.pdf
        175 kB
        Quang-Nhat HOANG-XUAN
      4. PIG-4066-revert.patch
        147 kB
        Daniel Dai
      5. PIG-4066.patch
        156 kB
        Quang-Nhat HOANG-XUAN
      6. PIG-4066.5.patch
        147 kB
        Quang-Nhat HOANG-XUAN
      7. PIG-4066.4.patch
        146 kB
        Quang-Nhat HOANG-XUAN
      8. PIG-4066.3.patch
        147 kB
        Quang-Nhat HOANG-XUAN
      9. PIG-4066.2.patch
        145 kB
        Quang-Nhat HOANG-XUAN
      10. Current Rollup vs Our Rollup.jpg
        108 kB
        Quang-Nhat HOANG-XUAN

        Issue Links

          Activity

          Hide
          Quang-Nhat HOANG-XUAN added a comment -

          Daniel Dai, thank you so much.

          Show
          Quang-Nhat HOANG-XUAN added a comment - Daniel Dai , thank you so much.
          Hide
          Daniel Dai added a comment -

          Cheolsoo Park, no problem, I will take care.

          Show
          Daniel Dai added a comment - Cheolsoo Park , no problem, I will take care.
          Hide
          Cheolsoo Park added a comment -

          Daniel Dai, sorry for the trouble and thanks for the clean-up.

          Show
          Cheolsoo Park added a comment - Daniel Dai , sorry for the trouble and thanks for the clean-up.
          Hide
          Daniel Dai added a comment -

          Patch reverted on 0.15 branch and trunk. Attach patch for reverting.

          Show
          Daniel Dai added a comment - Patch reverted on 0.15 branch and trunk. Attach patch for reverting.
          Hide
          Daniel Dai added a comment -

          Looking at the patch while trying to document it. The idea is good and simple, however, there are couple of issues in the implementation:
          1. Some basic queries does not work, eg: "cubed_and_rolled = CUBE salesinp BY CUBE(product,year), ROLLUP(region, state, city) pivot 1;"
          2. Even if there is no "pivot" keyword, the implementation still using the new Pivot code
          3. All script will go through RollupHIIOptimizer, it's on by default. Both #2 and #3 makes it impossible to just make it experimental feature and ship
          4. The logic of RollupHII should be wrapped into the new operator, not necessary propagate to cogroup/UserFuncExpression, etc
          5. There are a lot of redundant code needs to be cleaned up
          6. Not a show stop but would like to port it to Tez as well

          I already did quite a few cleanup. Since it will touch a majority part of the original patch, to make the commit history less confusing, I'd like to rollback the patch completely first and then redo it.

          Show
          Daniel Dai added a comment - Looking at the patch while trying to document it. The idea is good and simple, however, there are couple of issues in the implementation: 1. Some basic queries does not work, eg: "cubed_and_rolled = CUBE salesinp BY CUBE(product,year), ROLLUP(region, state, city) pivot 1;" 2. Even if there is no "pivot" keyword, the implementation still using the new Pivot code 3. All script will go through RollupHIIOptimizer, it's on by default. Both #2 and #3 makes it impossible to just make it experimental feature and ship 4. The logic of RollupHII should be wrapped into the new operator, not necessary propagate to cogroup/UserFuncExpression, etc 5. There are a lot of redundant code needs to be cleaned up 6. Not a show stop but would like to port it to Tez as well I already did quite a few cleanup. Since it will touch a majority part of the original patch, to make the commit history less confusing, I'd like to rollback the patch completely first and then redo it.
          Hide
          Cheolsoo Park added a comment -

          Committed to trunk.

          Show
          Cheolsoo Park added a comment - Committed to trunk.
          Hide
          Quang-Nhat HOANG-XUAN added a comment -

          Thank you so much for your time Cheolsoo Park!
          I will open a jira soon and add its document!

          Show
          Quang-Nhat HOANG-XUAN added a comment - Thank you so much for your time Cheolsoo Park ! I will open a jira soon and add its document!
          Hide
          Cheolsoo Park added a comment -

          +1.

          I will commit this patch today. This optimization is disabled by default and only applicable to MR, so it shouldn't break anything. Nevertheless, I ran full unit tests and e2e tests, and both were clean.

          Quang-Nhat HOANG-XUAN, we should document this. Do you mind opening another jira to add document? I think optimization-rules is the best place to put it.

          Show
          Cheolsoo Park added a comment - +1. I will commit this patch today. This optimization is disabled by default and only applicable to MR, so it shouldn't break anything. Nevertheless, I ran full unit tests and e2e tests, and both were clean. Quang-Nhat HOANG-XUAN , we should document this. Do you mind opening another jira to add document? I think optimization-rules is the best place to put it.

            People

            • Assignee:
              Quang-Nhat HOANG-XUAN
              Reporter:
              Quang-Nhat HOANG-XUAN
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development