Pig
  1. Pig
  2. PIG-4066

An optimization for ROLLUP operation in Pig

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.15.0
    • Component/s: None
    • Patch Info:
      Patch Available

      Description

      This patch aims at addressing the current limitation of the ROLLUP operator in PIG: most of the work is done in the Map phase of the underlying MapReduce job to generate all possible intermediate keys that the reducer use to aggregate and produce the ROLLUP output. Based on our previous work: “Duy-Hung Phan, Matteo Dell’Amico, Pietro Michiardi: On the design space of MapReduce ROLLUP aggregates” (http://www.eurecom.fr/en/publication/4212/download/rs-publi-4212_2.pdf), we show that the design space for a ROLLUP implementation allows for a different approach (in-reducer grouping, IRG), in which less work is done in the Map phase and the grouping is done in the Reduce phase. This patch presents the most efficient implementation we designed (Hybrid IRG), which allows defining a parameter to balance between parallelism (in the reducers) and communication cost.
      This patch contains the following features:
      1. The new ROLLUP approach: IRG, Hybrid IRG.
      2. The PIVOT clause in CUBE operators.
      3. Test cases.
      The new syntax to use our ROLLUP approach:
      alias = CUBE rel BY

      { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]} [, { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]}

      ...]
      In case there is multiple ROLLUP operator in one CUBE clause, the last ROLLUP operator will be executed with our approach (IRG, Hybrid IRG) while the remaining ROLLUP ahead will be executed with the default approach.
      We have already made some experiments for comparison between our ROLLUP implementation and the current ROLLUP. More information can be found at here: http://hxquangnhat.github.io/PIG-ROLLUP-H2IRG/
      Patch can be reviewed at here: https://reviews.apache.org/r/23804/

      1. Current Rollup vs Our Rollup.jpg
        108 kB
        Quang-Nhat HOANG-XUAN
      2. PIG-4066.2.patch
        145 kB
        Quang-Nhat HOANG-XUAN
      3. PIG-4066.3.patch
        147 kB
        Quang-Nhat HOANG-XUAN
      4. PIG-4066.4.patch
        146 kB
        Quang-Nhat HOANG-XUAN
      5. PIG-4066.5.patch
        147 kB
        Quang-Nhat HOANG-XUAN
      6. PIG-4066.patch
        156 kB
        Quang-Nhat HOANG-XUAN
      7. TechnicalNotes.2.pdf
        175 kB
        Quang-Nhat HOANG-XUAN
      8. TechnicalNotes.pdf
        182 kB
        Quang-Nhat HOANG-XUAN
      9. UserGuide.pdf
        87 kB
        Quang-Nhat HOANG-XUAN

        Activity

        Cheolsoo Park made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Fix Version/s 0.15.0 [ 12328760 ]
        Resolution Fixed [ 1 ]
        Quang-Nhat HOANG-XUAN made changes -
        Attachment PIG-4066.5.patch [ 12685605 ]
        Quang-Nhat HOANG-XUAN made changes -
        Attachment PIG-4066.4.patch [ 12681875 ]
        Quang-Nhat HOANG-XUAN made changes -
        Attachment PIG-4066.3.patch [ 12679181 ]
        Quang-Nhat HOANG-XUAN made changes -
        Attachment PIG-4066.2.patch [ 12668085 ]
        Quang-Nhat HOANG-XUAN made changes -
        Attachment PIG-4066.2.patch [ 12668072 ]
        Quang-Nhat HOANG-XUAN made changes -
        Attachment TechnicalNotes.2.pdf [ 12668073 ]
        Quang-Nhat HOANG-XUAN made changes -
        Attachment PIG-4066.2.patch [ 12668072 ]
        Quang-Nhat HOANG-XUAN made changes -
        Patch Info Patch Available [ 10042 ]
        Quang-Nhat HOANG-XUAN made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Quang-Nhat HOANG-XUAN made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Quang-Nhat HOANG-XUAN made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Cheolsoo Park made changes -
        Assignee Quang-Nhat HOANG-XUAN [ hxquangnhat ]
        Quang-Nhat HOANG-XUAN made changes -
        Description This patch aims at addressing the current limitation of the ROLLUP operator in PIG: most of the work is done in the Map phase of the underlying MapReduce job to generate all possible intermediate keys that the reducer use to aggregate and produce the ROLLUP output. Based on our previous work: “Duy-Hung Phan, Matteo Dell’Amico, Pietro Michiardi: On the design space of MapReduce ROLLUP aggregates” (http://www.eurecom.fr/en/publication/4212/download/rs-publi-4212_2.pdf), we show that the design space for a ROLLUP implementation allows for a different approach (in-reducer grouping, IRG), in which less work is done in the Map phase and the grouping is done in the Reduce phase. This patch presents the most efficient implementation we designed (Hybrid IRG), which allows defining a parameter to balance between parallelism (in the reducers) and communication cost.
        This patch contains the following features:
        1. The new ROLLUP approach: IRG, Hybrid IRG.
        2. The PIVOT clause in CUBE operators.
        3. Test cases.
        The new syntax to use our ROLLUP approach:
        alias = CUBE rel BY
        { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]} [, { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]}...]
        In case there is multiple ROLLUP operator in one CUBE clause, the last ROLLUP operator will be executed with our approach (IRG, Hybrid IRG) while the remaining ROLLUP ahead will be executed with the default approach.
        We have already made some experiments for comparison between our ROLLUP implementation and the current ROLLUP. More information can be found at here: http://hxquangnhat.github.io/PIG-ROLLUP-H2IRG/
        Patch can be reviewed at here: https://reviews.apache.org/r/23804/
        This patch aims at addressing the current limitation of the ROLLUP operator in PIG: most of the work is done in the Map phase of the underlying MapReduce job to generate all possible intermediate keys that the reducer use to aggregate and produce the ROLLUP output. Based on our previous work: “Duy-Hung Phan, Matteo Dell’Amico, Pietro Michiardi: On the design space of MapReduce ROLLUP aggregates” (http://www.eurecom.fr/en/publication/4212/download/rs-publi-4212_2.pdf), we show that the design space for a ROLLUP implementation allows for a different approach (in-reducer grouping, IRG), in which less work is done in the Map phase and the grouping is done in the Reduce phase. This patch presents the most efficient implementation we designed (Hybrid IRG), which allows defining a parameter to balance between parallelism (in the reducers) and communication cost.
        This patch contains the following features:
        1. The new ROLLUP approach: IRG, Hybrid IRG.
        2. The PIVOT clause in CUBE operators.
        3. Test cases.
        The new syntax to use our ROLLUP approach:
        alias = CUBE rel BY { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]} [, { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]}...]
        In case there is multiple ROLLUP operator in one CUBE clause, the last ROLLUP operator will be executed with our approach (IRG, Hybrid IRG) while the remaining ROLLUP ahead will be executed with the default approach.
        We have already made some experiments for comparison between our ROLLUP implementation and the current ROLLUP. More information can be found at here: http://hxquangnhat.github.io/PIG-ROLLUP-H2IRG/
        Patch can be reviewed at here: https://reviews.apache.org/r/23804/
        Quang-Nhat HOANG-XUAN made changes -
        Description This patch aims at addressing the current limitation of the ROLLUP operator in PIG: most of the work is done in the Map phase of the underlying MapReduce job to generate all possible intermediate keys that the reducer use to aggregate and produce the ROLLUP output. Based on our previous work: “Duy-Hung Phan, Matteo Dell’Amico, Pietro Michiardi: On the design space of MapReduce ROLLUP aggregates” (http://www.eurecom.fr/en/publication/4212/download/rs-publi-4212_2.pdf), we show that the design space for a ROLLUP implementation allows for a different approach (in-reducer grouping, IRG), in which less work is done in the Map phase and the grouping is done in the Reduce phase. This patch presents the most efficient implementation we designed (Hybrid IRG), which allows defining a parameter to balance between parallelism (in the reducers) and communication cost.
        This patch contains the following features:
        1. The new ROLLUP approach: IRG, Hybrid IRG.
        2. The PIVOT clause in CUBE operators.
        3. Test cases.
        The new syntax to use our ROLLUP approach:
        alias = CUBE rel BY
        { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]} [, { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]}...]
        In case there is multiple ROLLUP operator in one CUBE clause, the last ROLLUP operator will be executed with our approach (IRG, Hybrid IRG) while the remaining ROLLUP ahead will be executed with the default approach.
        We have already made some experiments for comparison between our ROLLUP implementation and the current ROLLUP. More information can be found at here: http://hxquangnhat.github.io/PIG-ROLLUP-H2IRG/
        This patch aims at addressing the current limitation of the ROLLUP operator in PIG: most of the work is done in the Map phase of the underlying MapReduce job to generate all possible intermediate keys that the reducer use to aggregate and produce the ROLLUP output. Based on our previous work: “Duy-Hung Phan, Matteo Dell’Amico, Pietro Michiardi: On the design space of MapReduce ROLLUP aggregates” (http://www.eurecom.fr/en/publication/4212/download/rs-publi-4212_2.pdf), we show that the design space for a ROLLUP implementation allows for a different approach (in-reducer grouping, IRG), in which less work is done in the Map phase and the grouping is done in the Reduce phase. This patch presents the most efficient implementation we designed (Hybrid IRG), which allows defining a parameter to balance between parallelism (in the reducers) and communication cost.
        This patch contains the following features:
        1. The new ROLLUP approach: IRG, Hybrid IRG.
        2. The PIVOT clause in CUBE operators.
        3. Test cases.
        The new syntax to use our ROLLUP approach:
        alias = CUBE rel BY
        { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]} [, { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]}...]
        In case there is multiple ROLLUP operator in one CUBE clause, the last ROLLUP operator will be executed with our approach (IRG, Hybrid IRG) while the remaining ROLLUP ahead will be executed with the default approach.
        We have already made some experiments for comparison between our ROLLUP implementation and the current ROLLUP. More information can be found at here: http://hxquangnhat.github.io/PIG-ROLLUP-H2IRG/
        Patch can be reviewed at here: https://reviews.apache.org/r/23804/
        Quang-Nhat HOANG-XUAN made changes -
        Labels perfomance hybrid-irg optimization rollup
        Quang-Nhat HOANG-XUAN made changes -
        Description This patch aims at addressing the current limitation of the ROLLUP operator in PIG: most of the work is done in the Map phase of the underlying MapReduce job to generate all possible intermediate keys that the reducer use to aggregate and produce the ROLLUP output. Based on our previous work: “Duy-Hung Phan, Matteo Dell’Amico, Pietro Michiardi: On the design space of MapReduce ROLLUP aggregates” (http://www.eurecom.fr/en/publication/4212/download/rs-publi-4212_2.pdf), we show that the design space for a ROLLUP implementation allows for a different approach (in-reducer grouping, IRG), in which less work is done in the Map phase and the grouping is done in the Reduce phase. This patch presents the most efficient implementation we designed (Hybrid IRG), which allows defining a parameter to balance between parallelism (in the reducers) and communication cost.
        This patch contains the following features:
        1. The new ROLLUP approach: IRG, Hybrid IRG.
        2. The PIVOT clause in CUBE operators.
        3. Test cases.
        The new syntax to use our ROLLUP approach:
        alias = CUBE rel BY
        { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]} [, { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]}
        ...]
        In case there is multiple ROLLUP operator in one CUBE clause, the last ROLLUP operator will be executed with our approach (IRG, Hybrid IRG) while the remaining ROLLUP ahead will be executed with the default approach.
        We have already made some experiments for comparison between our ROLLUP implementation and the current ROLLUP. More information can be found at here: http://hxquangnhat.github.io/PIG-ROLLUP-H2IRG/
        This patch aims at addressing the current limitation of the ROLLUP operator in PIG: most of the work is done in the Map phase of the underlying MapReduce job to generate all possible intermediate keys that the reducer use to aggregate and produce the ROLLUP output. Based on our previous work: “Duy-Hung Phan, Matteo Dell’Amico, Pietro Michiardi: On the design space of MapReduce ROLLUP aggregates” (http://www.eurecom.fr/en/publication/4212/download/rs-publi-4212_2.pdf), we show that the design space for a ROLLUP implementation allows for a different approach (in-reducer grouping, IRG), in which less work is done in the Map phase and the grouping is done in the Reduce phase. This patch presents the most efficient implementation we designed (Hybrid IRG), which allows defining a parameter to balance between parallelism (in the reducers) and communication cost.
        This patch contains the following features:
        1. The new ROLLUP approach: IRG, Hybrid IRG.
        2. The PIVOT clause in CUBE operators.
        3. Test cases.
        The new syntax to use our ROLLUP approach:
        alias = CUBE rel BY
        { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]} [, { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]}...]
        In case there is multiple ROLLUP operator in one CUBE clause, the last ROLLUP operator will be executed with our approach (IRG, Hybrid IRG) while the remaining ROLLUP ahead will be executed with the default approach.
        We have already made some experiments for comparison between our ROLLUP implementation and the current ROLLUP. More information can be found at here: http://hxquangnhat.github.io/PIG-ROLLUP-H2IRG/
        Quang-Nhat HOANG-XUAN made changes -
        Patch Info Patch Available [ 10042 ]
        Quang-Nhat HOANG-XUAN made changes -
        Attachment Current Rollup vs Our Rollup.jpg [ 12657098 ]
        Quang-Nhat HOANG-XUAN made changes -
        Attachment UserGuide.pdf [ 12657097 ]
        Quang-Nhat HOANG-XUAN made changes -
        Attachment TechnicalNotes.pdf [ 12657096 ]
        Quang-Nhat HOANG-XUAN made changes -
        Attachment PIG-4066.patch [ 12657095 ]
        Quang-Nhat HOANG-XUAN made changes -
        Field Original Value New Value
        Description This patch aims at addressing the current limitation of the ROLLUP operator in PIG: most of the work is done in the Map phase of the underlying MapReduce job to generate all possible intermediate keys that the reducer use to aggregate and produce the ROLLUP output. Based on our previous work: “On the design space of MapReduce ROLLUP aggregates” (http://www.eurecom.fr/en/publication/4212/download/rs-publi-4212_2.pdf), we show that the design space for a ROLLUP implementation allows for a different approach (in-reducer grouping, IRG), in which less work is done in the Map phase and the grouping is done in the Reduce phase. This patch presents the most efficient implementation we designed (Hybrid IRG), which allows defining a parameter to balance between parallelism (in the reducers) and communication cost.
        This patch contains the following features:
        1. The new ROLLUP approach: IRG, Hybrid IRG.
        2. The PIVOT clause in CUBE operators.
        3. Test cases.
        The new syntax to use our ROLLUP approach:
        alias = CUBE rel BY
        { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]} [, { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]}
        ...]
        In case there is multiple ROLLUP operator in one CUBE clause, the last ROLLUP operator will be executed with our approach (IRG, Hybrid IRG) while the remaining ROLLUP ahead will be executed with the default approach.
        We have already made some experiments for comparison between our ROLLUP implementation and the current ROLLUP. More information can be found at here: http://hxquangnhat.github.io/PIG-ROLLUP-H2IRG/
        This patch aims at addressing the current limitation of the ROLLUP operator in PIG: most of the work is done in the Map phase of the underlying MapReduce job to generate all possible intermediate keys that the reducer use to aggregate and produce the ROLLUP output. Based on our previous work: “Duy-Hung Phan, Matteo Dell’Amico, Pietro Michiardi: On the design space of MapReduce ROLLUP aggregates” (http://www.eurecom.fr/en/publication/4212/download/rs-publi-4212_2.pdf), we show that the design space for a ROLLUP implementation allows for a different approach (in-reducer grouping, IRG), in which less work is done in the Map phase and the grouping is done in the Reduce phase. This patch presents the most efficient implementation we designed (Hybrid IRG), which allows defining a parameter to balance between parallelism (in the reducers) and communication cost.
        This patch contains the following features:
        1. The new ROLLUP approach: IRG, Hybrid IRG.
        2. The PIVOT clause in CUBE operators.
        3. Test cases.
        The new syntax to use our ROLLUP approach:
        alias = CUBE rel BY
        { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]} [, { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]}
        ...]
        In case there is multiple ROLLUP operator in one CUBE clause, the last ROLLUP operator will be executed with our approach (IRG, Hybrid IRG) while the remaining ROLLUP ahead will be executed with the default approach.
        We have already made some experiments for comparison between our ROLLUP implementation and the current ROLLUP. More information can be found at here: http://hxquangnhat.github.io/PIG-ROLLUP-H2IRG/
        Quang-Nhat HOANG-XUAN created issue -

          People

          • Assignee:
            Quang-Nhat HOANG-XUAN
            Reporter:
            Quang-Nhat HOANG-XUAN
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development