Pig
  1. Pig
  2. PIG-2167 CUBE operation in Pig
  3. PIG-2765

Implementing RollupDimensions UDF and adding ROLLUP clause in CUBE operator

    Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.11
    • Component/s: None
    • Labels:
      None

      Description

      Implement RollupDimensions UDF which performs aggregation from most detailed level of dimensions to the most general level (grand total) in hierarchical order. Provide support for ROLLUP clause in CUBE operator.

      1. PIG-2765.1.patch
        44 kB
        Prasanth J
      2. PIG-2765.2.git.patch
        89 kB
        Prasanth J
      3. PIG-2765.2.patch
        88 kB
        Prasanth J

        Activity

        Hide
        Prasanth J added a comment -

        This patch contains the following features
        1) RollupDimensions UDF
        2) Support for ROLLUP clause in CUBE operator
        3) Testcases for both
        4) Removes "dimensions::" namespace from the output schema of cube operator

        The syntax for the CUBE operator is now

        alias = CUBE rel BY { CUBE | ROLLUP } col_ref [, { CUBE | ROLLUP } col_ref ...]
        

        Example:

        out = CUBE inp BY CUBE(a,b), ROLLUP(c,d);
        

        the above code will generate following combinations of aggregations for each input tuples
        (a,b,c,d)
        (a,NULL,c,d)
        (NULL,b,c,d)
        (NULL,NULL,c,d)
        (a,b,c,NULL)
        (a,NULL,c,NULL)
        (NULL,b,c,NULL)
        (NULL,NULL,c,NULL)
        (a,b,NULL,NULL)
        (a,NULL,NULL,NULL)
        (NULL,b,NULL,NULL)
        (NULL,NULL,NULL,NULL)

        Schema for "out" will be

        out: {group: (a: bytearray,b: bytearray,c: bytearray,d: bytearray),cube: {(a: bytearray,b: bytearray,c: bytearray,d: bytearray)}}
        

        NOTE: NULL value handling is not available in this patch. Patch for legitimate NULL value handling is available at https://issues.apache.org/jira/browse/PIG-2726

        Show
        Prasanth J added a comment - This patch contains the following features 1) RollupDimensions UDF 2) Support for ROLLUP clause in CUBE operator 3) Testcases for both 4) Removes "dimensions::" namespace from the output schema of cube operator The syntax for the CUBE operator is now alias = CUBE rel BY { CUBE | ROLLUP } col_ref [, { CUBE | ROLLUP } col_ref ...] Example: out = CUBE inp BY CUBE(a,b), ROLLUP(c,d); the above code will generate following combinations of aggregations for each input tuples (a,b,c,d) (a,NULL,c,d) (NULL,b,c,d) (NULL,NULL,c,d) (a,b,c,NULL) (a,NULL,c,NULL) (NULL,b,c,NULL) (NULL,NULL,c,NULL) (a,b,NULL,NULL) (a,NULL,NULL,NULL) (NULL,b,NULL,NULL) (NULL,NULL,NULL,NULL) Schema for "out" will be out: {group: (a: bytearray,b: bytearray,c: bytearray,d: bytearray),cube: {(a: bytearray,b: bytearray,c: bytearray,d: bytearray)}} NOTE: NULL value handling is not available in this patch. Patch for legitimate NULL value handling is available at https://issues.apache.org/jira/browse/PIG-2726
        Hide
        Prasanth J added a comment -

        Review board request for this patch: https://reviews.apache.org/r/5521/

        Show
        Prasanth J added a comment - Review board request for this patch: https://reviews.apache.org/r/5521/
        Hide
        Prasanth J added a comment -

        Since the second version of this patch is generated using git, I have created a new review board request https://reviews.apache.org/r/5733/

        Please let me know in case of any issues.

        Show
        Prasanth J added a comment - Since the second version of this patch is generated using git, I have created a new review board request https://reviews.apache.org/r/5733/ Please let me know in case of any issues.
        Hide
        Dmitriy V. Ryaboy added a comment -

        I'll review tonight. FYI you can update old reviews, then I can just look at the bits that changed

        Show
        Dmitriy V. Ryaboy added a comment - I'll review tonight. FYI you can update old reviews, then I can just look at the bits that changed
        Hide
        Prasanth J added a comment -

        Since the old review is based on SVN trunk, I am not able to upload git patch to it. So I created a separate RB request with git trunk (pig-git) and uploaded the patch. Also, I uploaded PIG-2726 patch as parent diff (PIG-2765 depends on PIG-2726) for this patch, which is supported only for git and other DVCS. Couldn't find a better way to handle this . Sorry for the inconvenience.

        Show
        Prasanth J added a comment - Since the old review is based on SVN trunk, I am not able to upload git patch to it. So I created a separate RB request with git trunk (pig-git) and uploaded the patch. Also, I uploaded PIG-2726 patch as parent diff ( PIG-2765 depends on PIG-2726 ) for this patch, which is supported only for git and other DVCS. Couldn't find a better way to handle this . Sorry for the inconvenience.
        Hide
        Prasanth J added a comment -

        Updated the old review with a new patch. Please use the old review https://reviews.apache.org/r/5521/ to look at the changed bits based on your review comment. Ignore my previous comment too

        Show
        Prasanth J added a comment - Updated the old review with a new patch. Please use the old review https://reviews.apache.org/r/5521/ to look at the changed bits based on your review comment. Ignore my previous comment too
        Hide
        Prasanth J added a comment -

        Added svn patch with changes based on Dmitriy's code review comments.

        Show
        Prasanth J added a comment - Added svn patch with changes based on Dmitriy's code review comments.
        Hide
        Dmitriy V. Ryaboy added a comment -

        Ok this looks good to commit. Please add release notes to this ticket (or to documentation, via a different ticket) that explain what Rollup does, how to use them, and what the syntax looks like.

        Show
        Dmitriy V. Ryaboy added a comment - Ok this looks good to commit. Please add release notes to this ticket (or to documentation, via a different ticket) that explain what Rollup does, how to use them, and what the syntax looks like.
        Hide
        Dmitriy V. Ryaboy added a comment -

        Committed to trunk (0.11)

        Show
        Dmitriy V. Ryaboy added a comment - Committed to trunk (0.11)
        Hide
        Prasanth J added a comment -

        Thanks Dmitriy for committing the code. I will add release notes to this ticket. I will also create another ticket for documentation purpose.

        Show
        Prasanth J added a comment - Thanks Dmitriy for committing the code. I will add release notes to this ticket. I will also create another ticket for documentation purpose.
        Hide
        Prasanth J added a comment -

        RELEASE NOTES

        Introduction
        Data cube
        In Online Analytical Processing Systems (OLAP), a data cube is a way of organizing and visualizing multi-dimensional data to get useful insights over some measure of interest. Measure is a term for numerical facts that are analyzed against various dimensions of data. Based on the type of aggregations performed, measures can be algebraic (like SUM, COUNT, MAX etc.) or holistic (TOP-K, MEDIAN etc.). The following articles on wikipedia provides good illustrations for data cubes
        http://en.wikipedia.org/wiki/Online_analytical_processing
        http://en.wikipedia.org/wiki/OLAP_cube

        Cube operation
        For a specified set of dimensions, cube operations computes aggregates for all possbile combinations of group by dimensions. For example, CUBE(product, location) will generate

        {(product,location), (product,null), (null,location), (null,null)}

        combinations for all tuples in product and location dimensions, where (null, null) represents the grand total.
        The number of group by combinations generated by cube for n dimensions will be 2^n.

        Rollup operation
        For a specified set of dimensions, rollup operations computes multiple levels of aggregates based on hierarchical ordering. For example, ROLLUP(region,state, city) will generate

        {(region, state, city), (region, state, null), (region, null, null), (null, null, null)}

        combinations for all tuples in region, state and city dimensions. It computes aggregates from the most-details level (region, state, city) to the grand total (null, null, null). Rollup is useful when there is hierarchical ordering on the dimensions.
        The number of group by combinations generated by rollup for n dimensions will be n+1.

        Cubing in Pig
        Pig supports CUBE operator which can be used to perform cube or rollup operations. The patches (PIG-2710, PIG-2765) adds syntactic sugar to the existing builtin CubeDimensions and RollupDimensions UDF. With this new addition, aggregations across multiple dimensions can be easily represented using CUBE operator. Following example illustrates the CUBE operator usage:

        Syntax

        outalias = CUBE inalias BY { CUBE expression | ROLLUP expression }, [ CUBE expression | ROLLUP expression ] [PARALLEL n];

        outalias - The name of the output relation
        inalias - The name of the input relation
        CUBE, BY, ROLLUP - keywords
        expression - projections (dimensions) of the relation.
        PARALLEL n - Increase the parallelism of a job by specifying the number of reduce tasks, n. The default value for n is 1 (one reduce task).

        Basic usage of CUBE operation

        salesinp = LOAD '/pig/data/salesdata' USING PigStorage(',') as (product:chararray, year:int, region:chararray, state:chararray, city:chararray, sales:long); 
        cubedinp = CUBE salesinp BY CUBE(product,year);
        result = FOREACH cubedinp GENERATE FLATTEN(group), SUM(cube.sales) as totalsales;
        

        Sample output
        For a sample input tuple (car, 2012, midwest, ohio, columbus, 4000), the above query with cube operation will output
        (car,2012,4000)
        (car,,4000)
        (,2012,4000)
        (,,4000)

        Output schema

        grunt> describe cubedinp;
        cubedinp: {group: (product: chararray,year: int),cube: {(product: chararray,year: int,region: chararray,state: chararray,city: chararray,sales: long)}}
        

        Note the second column, ‘cube’ field which is a bag of all tuples that belong to ‘group’. Also note that the measure attribute ‘sales’ along with other unused dimensions in load statement are pushed down so that it can be referenced later while computing aggregates on the measure, like in this case SUM(cube.sales).

        Basic usage of ROLLUP operation

        salesinp = LOAD '/pig/data/salesdata' USING PigStorage(',') as (product:chararray, year:int, region:chararray, state:chararray, city:chararray, sales:long); 
        rolledup = CUBE salesinp BY ROLLUP(region,state,city);
        result = FOREACH rolledup GENERATE FLATTEN(group), SUM(cube.sales) as totalsales;
        

        Sample output
        For a sample input tuple (car, 2012, midwest, ohio, columbus, 4000), the above query with rollup operation will output
        (midwest,ohio,columbus,4000)
        (midwest,ohio,,4000)
        (midwest,,,4000)
        (,,,4000)

        Output schema

        grunt> describe rolledup;
        rolledup: {group: (region: chararray,state: chararray,city: chararray),cube: {(region: chararray,state: chararray,city: chararray,product: chararray,year: int,sales: long)}}
        

        Basic usage of CUBE and ROLLUP operation combined
        If CUBE and ROLLUP operations are used together, the output groups will be the cross product of all groups generated by cube and rollup operation. If there are m dimensions in cube operations and n dimensions in rollup operation then overall number of combinations will be (2^m) * (n+1).

        salesinp = LOAD '/pig/data/salesdata' USING PigStorage(',') as (product:chararray, year:int, region:chararray, state:chararray, city:chararray, sales:long); 
        cubed_and_rolled = CUBE salesinp BY CUBE(product,year), ROLLUP(region, state, city);
        result = FOREACH cubed_and_rolled GENERATE FLATTEN(group), SUM(cube.sales) as totalsales;
        

        Sample output
        For a sample input tuple (car, 2012, midwest, ohio, columbus, 4000), the above query with cube and rollup operation will output
        (car,2012,midwest,ohio,columbus,4000)
        (car,2012,midwest,ohio,,4000)
        (car,2012,midwest,,,4000)
        (car,2012,,,,4000)
        (car,,midwest,ohio,columbus,4000)
        (car,,midwest,ohio,,4000)
        (car,,midwest,,,4000)
        (car,,,,,4000)
        (,2012,midwest,ohio,columbus,4000)
        (,2012,midwest,ohio,,4000)
        (,2012,midwest,,,4000)
        (,2012,,,,4000)
        (,,midwest,ohio,columbus,4000)
        (,,midwest,ohio,,4000)
        (,,midwest,,,4000)
        (,,,,,4000)

        Output schema

        grunt> describe cubed_and_rolled;
        cubed_and_rolled: {group: (product: chararray,year: int,region: chararray,state: chararray,city: chararray),cube: {(product: chararray,year: int,region: chararray,state: chararray,city: chararray,sales: long)}}
        

        Handling null values in dimensions
        Since null values are used to represent subtotals in cube and rollup operation, in order to differentiate the legitimate null values that already exists as dimension values, CUBE operator converts any null values in dimensions to "unknown" value before performing cube or rollup operation. For example, for CUBE(product,location) with a sample tuple (car,null) the output will be

        {(car,unknown), (car,null), (null,unknown), (null,null)}

        .

        Show
        Prasanth J added a comment - RELEASE NOTES Introduction Data cube In Online Analytical Processing Systems (OLAP), a data cube is a way of organizing and visualizing multi-dimensional data to get useful insights over some measure of interest. Measure is a term for numerical facts that are analyzed against various dimensions of data. Based on the type of aggregations performed, measures can be algebraic (like SUM, COUNT, MAX etc.) or holistic (TOP-K, MEDIAN etc.). The following articles on wikipedia provides good illustrations for data cubes http://en.wikipedia.org/wiki/Online_analytical_processing http://en.wikipedia.org/wiki/OLAP_cube Cube operation For a specified set of dimensions, cube operations computes aggregates for all possbile combinations of group by dimensions. For example, CUBE(product, location) will generate {(product,location), (product,null), (null,location), (null,null)} combinations for all tuples in product and location dimensions, where (null, null) represents the grand total. The number of group by combinations generated by cube for n dimensions will be 2^n. Rollup operation For a specified set of dimensions, rollup operations computes multiple levels of aggregates based on hierarchical ordering. For example, ROLLUP(region,state, city) will generate {(region, state, city), (region, state, null), (region, null, null), (null, null, null)} combinations for all tuples in region, state and city dimensions. It computes aggregates from the most-details level (region, state, city) to the grand total (null, null, null). Rollup is useful when there is hierarchical ordering on the dimensions. The number of group by combinations generated by rollup for n dimensions will be n+1. Cubing in Pig Pig supports CUBE operator which can be used to perform cube or rollup operations. The patches ( PIG-2710 , PIG-2765 ) adds syntactic sugar to the existing builtin CubeDimensions and RollupDimensions UDF. With this new addition, aggregations across multiple dimensions can be easily represented using CUBE operator. Following example illustrates the CUBE operator usage: Syntax outalias = CUBE inalias BY { CUBE expression | ROLLUP expression }, [ CUBE expression | ROLLUP expression ] [PARALLEL n]; outalias - The name of the output relation inalias - The name of the input relation CUBE, BY, ROLLUP - keywords expression - projections (dimensions) of the relation. PARALLEL n - Increase the parallelism of a job by specifying the number of reduce tasks, n. The default value for n is 1 (one reduce task). Basic usage of CUBE operation salesinp = LOAD '/pig/data/salesdata' USING PigStorage(',') as (product:chararray, year: int , region:chararray, state:chararray, city:chararray, sales: long ); cubedinp = CUBE salesinp BY CUBE(product,year); result = FOREACH cubedinp GENERATE FLATTEN(group), SUM(cube.sales) as totalsales; Sample output For a sample input tuple (car, 2012, midwest, ohio, columbus, 4000), the above query with cube operation will output (car,2012,4000) (car,,4000) (,2012,4000) (,,4000) Output schema grunt> describe cubedinp; cubedinp: {group: (product: chararray,year: int ),cube: {(product: chararray,year: int ,region: chararray,state: chararray,city: chararray,sales: long )}} Note the second column, ‘cube’ field which is a bag of all tuples that belong to ‘group’. Also note that the measure attribute ‘sales’ along with other unused dimensions in load statement are pushed down so that it can be referenced later while computing aggregates on the measure, like in this case SUM(cube.sales). Basic usage of ROLLUP operation salesinp = LOAD '/pig/data/salesdata' USING PigStorage(',') as (product:chararray, year: int , region:chararray, state:chararray, city:chararray, sales: long ); rolledup = CUBE salesinp BY ROLLUP(region,state,city); result = FOREACH rolledup GENERATE FLATTEN(group), SUM(cube.sales) as totalsales; Sample output For a sample input tuple (car, 2012, midwest, ohio, columbus, 4000), the above query with rollup operation will output (midwest,ohio,columbus,4000) (midwest,ohio,,4000) (midwest,,,4000) (,,,4000) Output schema grunt> describe rolledup; rolledup: {group: (region: chararray,state: chararray,city: chararray),cube: {(region: chararray,state: chararray,city: chararray,product: chararray,year: int ,sales: long )}} Basic usage of CUBE and ROLLUP operation combined If CUBE and ROLLUP operations are used together, the output groups will be the cross product of all groups generated by cube and rollup operation. If there are m dimensions in cube operations and n dimensions in rollup operation then overall number of combinations will be (2^m) * (n+1). salesinp = LOAD '/pig/data/salesdata' USING PigStorage(',') as (product:chararray, year: int , region:chararray, state:chararray, city:chararray, sales: long ); cubed_and_rolled = CUBE salesinp BY CUBE(product,year), ROLLUP(region, state, city); result = FOREACH cubed_and_rolled GENERATE FLATTEN(group), SUM(cube.sales) as totalsales; Sample output For a sample input tuple (car, 2012, midwest, ohio, columbus, 4000), the above query with cube and rollup operation will output (car,2012,midwest,ohio,columbus,4000) (car,2012,midwest,ohio,,4000) (car,2012,midwest,,,4000) (car,2012,,,,4000) (car,,midwest,ohio,columbus,4000) (car,,midwest,ohio,,4000) (car,,midwest,,,4000) (car,,,,,4000) (,2012,midwest,ohio,columbus,4000) (,2012,midwest,ohio,,4000) (,2012,midwest,,,4000) (,2012,,,,4000) (,,midwest,ohio,columbus,4000) (,,midwest,ohio,,4000) (,,midwest,,,4000) (,,,,,4000) Output schema grunt> describe cubed_and_rolled; cubed_and_rolled: {group: (product: chararray,year: int ,region: chararray,state: chararray,city: chararray),cube: {(product: chararray,year: int ,region: chararray,state: chararray,city: chararray,sales: long )}} Handling null values in dimensions Since null values are used to represent subtotals in cube and rollup operation, in order to differentiate the legitimate null values that already exists as dimension values, CUBE operator converts any null values in dimensions to "unknown" value before performing cube or rollup operation. For example, for CUBE(product,location) with a sample tuple (car,null) the output will be {(car,unknown), (car,null), (null,unknown), (null,null)} .

          People

          • Assignee:
            Prasanth J
            Reporter:
            Prasanth J
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development