# Implementing RollupDimensions UDF and adding ROLLUP clause in CUBE operator

## Details

• Status: Closed
• Priority: Major
• Resolution: Fixed
• Affects Version/s: None
• Fix Version/s:
• Component/s: None
• Labels:
None

## Description

Implement RollupDimensions UDF which performs aggregation from most detailed level of dimensions to the most general level (grand total) in hierarchical order. Provide support for ROLLUP clause in CUBE operator.

## Attachments

1. PIG-2765.1.patch
44 kB
Prasanth Jayachandran
2. PIG-2765.2.git.patch
89 kB
Prasanth Jayachandran
3. PIG-2765.2.patch
88 kB
Prasanth Jayachandran

## Activity

Hide
Prasanth Jayachandran added a comment -

This patch contains the following features
1) RollupDimensions UDF
2) Support for ROLLUP clause in CUBE operator
3) Testcases for both
4) Removes "dimensions::" namespace from the output schema of cube operator

The syntax for the CUBE operator is now

```alias = CUBE rel BY { CUBE | ROLLUP } col_ref [, { CUBE | ROLLUP } col_ref ...]
```

Example:

```out = CUBE inp BY CUBE(a,b), ROLLUP(c,d);
```

the above code will generate following combinations of aggregations for each input tuples
(a,b,c,d)
(a,NULL,c,d)
(NULL,b,c,d)
(NULL,NULL,c,d)
(a,b,c,NULL)
(a,NULL,c,NULL)
(NULL,b,c,NULL)
(NULL,NULL,c,NULL)
(a,b,NULL,NULL)
(a,NULL,NULL,NULL)
(NULL,b,NULL,NULL)
(NULL,NULL,NULL,NULL)

Schema for "out" will be

```out: {group: (a: bytearray,b: bytearray,c: bytearray,d: bytearray),cube: {(a: bytearray,b: bytearray,c: bytearray,d: bytearray)}}
```

NOTE: NULL value handling is not available in this patch. Patch for legitimate NULL value handling is available at https://issues.apache.org/jira/browse/PIG-2726

Show
Prasanth Jayachandran added a comment - This patch contains the following features 1) RollupDimensions UDF 2) Support for ROLLUP clause in CUBE operator 3) Testcases for both 4) Removes "dimensions::" namespace from the output schema of cube operator The syntax for the CUBE operator is now alias = CUBE rel BY { CUBE | ROLLUP } col_ref [, { CUBE | ROLLUP } col_ref ...] Example: out = CUBE inp BY CUBE(a,b), ROLLUP(c,d); the above code will generate following combinations of aggregations for each input tuples (a,b,c,d) (a,NULL,c,d) (NULL,b,c,d) (NULL,NULL,c,d) (a,b,c,NULL) (a,NULL,c,NULL) (NULL,b,c,NULL) (NULL,NULL,c,NULL) (a,b,NULL,NULL) (a,NULL,NULL,NULL) (NULL,b,NULL,NULL) (NULL,NULL,NULL,NULL) Schema for "out" will be out: {group: (a: bytearray,b: bytearray,c: bytearray,d: bytearray),cube: {(a: bytearray,b: bytearray,c: bytearray,d: bytearray)}} NOTE: NULL value handling is not available in this patch. Patch for legitimate NULL value handling is available at https://issues.apache.org/jira/browse/PIG-2726
Hide
Prasanth Jayachandran added a comment -

Review board request for this patch: https://reviews.apache.org/r/5521/

Show
Prasanth Jayachandran added a comment - Review board request for this patch: https://reviews.apache.org/r/5521/
Hide
Prasanth Jayachandran added a comment -

Since the second version of this patch is generated using git, I have created a new review board request https://reviews.apache.org/r/5733/

Please let me know in case of any issues.

Show
Prasanth Jayachandran added a comment - Since the second version of this patch is generated using git, I have created a new review board request https://reviews.apache.org/r/5733/ Please let me know in case of any issues.
Hide
Dmitriy V. Ryaboy added a comment -

I'll review tonight. FYI you can update old reviews, then I can just look at the bits that changed

Show
Dmitriy V. Ryaboy added a comment - I'll review tonight. FYI you can update old reviews, then I can just look at the bits that changed
Hide
Prasanth Jayachandran added a comment -

Since the old review is based on SVN trunk, I am not able to upload git patch to it. So I created a separate RB request with git trunk (pig-git) and uploaded the patch. Also, I uploaded PIG-2726 patch as parent diff (PIG-2765 depends on PIG-2726) for this patch, which is supported only for git and other DVCS. Couldn't find a better way to handle this . Sorry for the inconvenience.

Show
Prasanth Jayachandran added a comment - Since the old review is based on SVN trunk, I am not able to upload git patch to it. So I created a separate RB request with git trunk (pig-git) and uploaded the patch. Also, I uploaded PIG-2726 patch as parent diff ( PIG-2765 depends on PIG-2726 ) for this patch, which is supported only for git and other DVCS. Couldn't find a better way to handle this . Sorry for the inconvenience.
Hide
Prasanth Jayachandran added a comment -

Updated the old review with a new patch. Please use the old review https://reviews.apache.org/r/5521/ to look at the changed bits based on your review comment. Ignore my previous comment too

Show
Prasanth Jayachandran added a comment - Updated the old review with a new patch. Please use the old review https://reviews.apache.org/r/5521/ to look at the changed bits based on your review comment. Ignore my previous comment too
Hide
Prasanth Jayachandran added a comment -

Show
Prasanth Jayachandran added a comment - Added svn patch with changes based on Dmitriy's code review comments.
Hide
Dmitriy V. Ryaboy added a comment -

Ok this looks good to commit. Please add release notes to this ticket (or to documentation, via a different ticket) that explain what Rollup does, how to use them, and what the syntax looks like.

Show
Dmitriy V. Ryaboy added a comment - Ok this looks good to commit. Please add release notes to this ticket (or to documentation, via a different ticket) that explain what Rollup does, how to use them, and what the syntax looks like.
Hide
Dmitriy V. Ryaboy added a comment -

Committed to trunk (0.11)

Show
Dmitriy V. Ryaboy added a comment - Committed to trunk (0.11)
Hide
Prasanth Jayachandran added a comment -

Thanks Dmitriy for committing the code. I will add release notes to this ticket. I will also create another ticket for documentation purpose.

Show
Prasanth Jayachandran added a comment - Thanks Dmitriy for committing the code. I will add release notes to this ticket. I will also create another ticket for documentation purpose.
Hide
Prasanth Jayachandran added a comment -

RELEASE NOTES

Introduction
Data cube
In Online Analytical Processing Systems (OLAP), a data cube is a way of organizing and visualizing multi-dimensional data to get useful insights over some measure of interest. Measure is a term for numerical facts that are analyzed against various dimensions of data. Based on the type of aggregations performed, measures can be algebraic (like SUM, COUNT, MAX etc.) or holistic (TOP-K, MEDIAN etc.). The following articles on wikipedia provides good illustrations for data cubes
http://en.wikipedia.org/wiki/Online_analytical_processing
http://en.wikipedia.org/wiki/OLAP_cube

Cube operation
For a specified set of dimensions, cube operations computes aggregates for all possbile combinations of group by dimensions. For example, CUBE(product, location) will generate

{(product,location), (product,null), (null,location), (null,null)}

combinations for all tuples in product and location dimensions, where (null, null) represents the grand total.
The number of group by combinations generated by cube for n dimensions will be 2^n.

Rollup operation
For a specified set of dimensions, rollup operations computes multiple levels of aggregates based on hierarchical ordering. For example, ROLLUP(region,state, city) will generate

{(region, state, city), (region, state, null), (region, null, null), (null, null, null)}

combinations for all tuples in region, state and city dimensions. It computes aggregates from the most-details level (region, state, city) to the grand total (null, null, null). Rollup is useful when there is hierarchical ordering on the dimensions.
The number of group by combinations generated by rollup for n dimensions will be n+1.

Cubing in Pig
Pig supports CUBE operator which can be used to perform cube or rollup operations. The patches (PIG-2710, PIG-2765) adds syntactic sugar to the existing builtin CubeDimensions and RollupDimensions UDF. With this new addition, aggregations across multiple dimensions can be easily represented using CUBE operator. Following example illustrates the CUBE operator usage:

Syntax

`outalias = CUBE inalias BY { CUBE expression | ROLLUP expression }, [ CUBE expression | ROLLUP expression ] [PARALLEL n];`

outalias - The name of the output relation
inalias - The name of the input relation
CUBE, BY, ROLLUP - keywords
expression - projections (dimensions) of the relation.
PARALLEL n - Increase the parallelism of a job by specifying the number of reduce tasks, n. The default value for n is 1 (one reduce task).

Basic usage of CUBE operation

```salesinp = LOAD '/pig/data/salesdata' USING PigStorage(',') as (product:chararray, year:int, region:chararray, state:chararray, city:chararray, sales:long);
cubedinp = CUBE salesinp BY CUBE(product,year);
result = FOREACH cubedinp GENERATE FLATTEN(group), SUM(cube.sales) as totalsales;
```

Sample output
For a sample input tuple (car, 2012, midwest, ohio, columbus, 4000), the above query with cube operation will output
(car,2012,4000)
(car,,4000)
(,2012,4000)
(,,4000)

Output schema

```grunt> describe cubedinp;
cubedinp: {group: (product: chararray,year: int),cube: {(product: chararray,year: int,region: chararray,state: chararray,city: chararray,sales: long)}}
```

Note the second column, ‘cube’ field which is a bag of all tuples that belong to ‘group’. Also note that the measure attribute ‘sales’ along with other unused dimensions in load statement are pushed down so that it can be referenced later while computing aggregates on the measure, like in this case SUM(cube.sales).

Basic usage of ROLLUP operation

```salesinp = LOAD '/pig/data/salesdata' USING PigStorage(',') as (product:chararray, year:int, region:chararray, state:chararray, city:chararray, sales:long);
rolledup = CUBE salesinp BY ROLLUP(region,state,city);
result = FOREACH rolledup GENERATE FLATTEN(group), SUM(cube.sales) as totalsales;
```

Sample output
For a sample input tuple (car, 2012, midwest, ohio, columbus, 4000), the above query with rollup operation will output
(midwest,ohio,columbus,4000)
(midwest,ohio,,4000)
(midwest,,,4000)
(,,,4000)

Output schema

```grunt> describe rolledup;
rolledup: {group: (region: chararray,state: chararray,city: chararray),cube: {(region: chararray,state: chararray,city: chararray,product: chararray,year: int,sales: long)}}
```

Basic usage of CUBE and ROLLUP operation combined
If CUBE and ROLLUP operations are used together, the output groups will be the cross product of all groups generated by cube and rollup operation. If there are m dimensions in cube operations and n dimensions in rollup operation then overall number of combinations will be (2^m) * (n+1).

```salesinp = LOAD '/pig/data/salesdata' USING PigStorage(',') as (product:chararray, year:int, region:chararray, state:chararray, city:chararray, sales:long);
cubed_and_rolled = CUBE salesinp BY CUBE(product,year), ROLLUP(region, state, city);
result = FOREACH cubed_and_rolled GENERATE FLATTEN(group), SUM(cube.sales) as totalsales;
```

Sample output
For a sample input tuple (car, 2012, midwest, ohio, columbus, 4000), the above query with cube and rollup operation will output
(car,2012,midwest,ohio,columbus,4000)
(car,2012,midwest,ohio,,4000)
(car,2012,midwest,,,4000)
(car,2012,,,,4000)
(car,,midwest,ohio,columbus,4000)
(car,,midwest,ohio,,4000)
(car,,midwest,,,4000)
(car,,,,,4000)
(,2012,midwest,ohio,columbus,4000)
(,2012,midwest,ohio,,4000)
(,2012,midwest,,,4000)
(,2012,,,,4000)
(,,midwest,ohio,columbus,4000)
(,,midwest,ohio,,4000)
(,,midwest,,,4000)
(,,,,,4000)

Output schema

```grunt> describe cubed_and_rolled;
cubed_and_rolled: {group: (product: chararray,year: int,region: chararray,state: chararray,city: chararray),cube: {(product: chararray,year: int,region: chararray,state: chararray,city: chararray,sales: long)}}
```

Handling null values in dimensions
Since null values are used to represent subtotals in cube and rollup operation, in order to differentiate the legitimate null values that already exists as dimension values, CUBE operator converts any null values in dimensions to "unknown" value before performing cube or rollup operation. For example, for CUBE(product,location) with a sample tuple (car,null) the output will be

{(car,unknown), (car,null), (null,unknown), (null,null)}

.

Show
Prasanth Jayachandran added a comment - RELEASE NOTES Introduction Data cube In Online Analytical Processing Systems (OLAP), a data cube is a way of organizing and visualizing multi-dimensional data to get useful insights over some measure of interest. Measure is a term for numerical facts that are analyzed against various dimensions of data. Based on the type of aggregations performed, measures can be algebraic (like SUM, COUNT, MAX etc.) or holistic (TOP-K, MEDIAN etc.). The following articles on wikipedia provides good illustrations for data cubes http://en.wikipedia.org/wiki/Online_analytical_processing http://en.wikipedia.org/wiki/OLAP_cube Cube operation For a specified set of dimensions, cube operations computes aggregates for all possbile combinations of group by dimensions. For example, CUBE(product, location) will generate {(product,location), (product,null), (null,location), (null,null)} combinations for all tuples in product and location dimensions, where (null, null) represents the grand total. The number of group by combinations generated by cube for n dimensions will be 2^n. Rollup operation For a specified set of dimensions, rollup operations computes multiple levels of aggregates based on hierarchical ordering. For example, ROLLUP(region,state, city) will generate {(region, state, city), (region, state, null), (region, null, null), (null, null, null)} combinations for all tuples in region, state and city dimensions. It computes aggregates from the most-details level (region, state, city) to the grand total (null, null, null). Rollup is useful when there is hierarchical ordering on the dimensions. The number of group by combinations generated by rollup for n dimensions will be n+1. Cubing in Pig Pig supports CUBE operator which can be used to perform cube or rollup operations. The patches ( PIG-2710 , PIG-2765 ) adds syntactic sugar to the existing builtin CubeDimensions and RollupDimensions UDF. With this new addition, aggregations across multiple dimensions can be easily represented using CUBE operator. Following example illustrates the CUBE operator usage: Syntax outalias = CUBE inalias BY { CUBE expression | ROLLUP expression }, [ CUBE expression | ROLLUP expression ] [PARALLEL n]; outalias - The name of the output relation inalias - The name of the input relation CUBE, BY, ROLLUP - keywords expression - projections (dimensions) of the relation. PARALLEL n - Increase the parallelism of a job by specifying the number of reduce tasks, n. The default value for n is 1 (one reduce task). Basic usage of CUBE operation salesinp = LOAD '/pig/data/salesdata' USING PigStorage(',') as (product:chararray, year: int , region:chararray, state:chararray, city:chararray, sales: long ); cubedinp = CUBE salesinp BY CUBE(product,year); result = FOREACH cubedinp GENERATE FLATTEN(group), SUM(cube.sales) as totalsales; Sample output For a sample input tuple (car, 2012, midwest, ohio, columbus, 4000), the above query with cube operation will output (car,2012,4000) (car,,4000) (,2012,4000) (,,4000) Output schema grunt> describe cubedinp; cubedinp: {group: (product: chararray,year: int ),cube: {(product: chararray,year: int ,region: chararray,state: chararray,city: chararray,sales: long )}} Note the second column, ‘cube’ field which is a bag of all tuples that belong to ‘group’. Also note that the measure attribute ‘sales’ along with other unused dimensions in load statement are pushed down so that it can be referenced later while computing aggregates on the measure, like in this case SUM(cube.sales). Basic usage of ROLLUP operation salesinp = LOAD '/pig/data/salesdata' USING PigStorage(',') as (product:chararray, year: int , region:chararray, state:chararray, city:chararray, sales: long ); rolledup = CUBE salesinp BY ROLLUP(region,state,city); result = FOREACH rolledup GENERATE FLATTEN(group), SUM(cube.sales) as totalsales; Sample output For a sample input tuple (car, 2012, midwest, ohio, columbus, 4000), the above query with rollup operation will output (midwest,ohio,columbus,4000) (midwest,ohio,,4000) (midwest,,,4000) (,,,4000) Output schema grunt> describe rolledup; rolledup: {group: (region: chararray,state: chararray,city: chararray),cube: {(region: chararray,state: chararray,city: chararray,product: chararray,year: int ,sales: long )}} Basic usage of CUBE and ROLLUP operation combined If CUBE and ROLLUP operations are used together, the output groups will be the cross product of all groups generated by cube and rollup operation. If there are m dimensions in cube operations and n dimensions in rollup operation then overall number of combinations will be (2^m) * (n+1). salesinp = LOAD '/pig/data/salesdata' USING PigStorage(',') as (product:chararray, year: int , region:chararray, state:chararray, city:chararray, sales: long ); cubed_and_rolled = CUBE salesinp BY CUBE(product,year), ROLLUP(region, state, city); result = FOREACH cubed_and_rolled GENERATE FLATTEN(group), SUM(cube.sales) as totalsales; Sample output For a sample input tuple (car, 2012, midwest, ohio, columbus, 4000), the above query with cube and rollup operation will output (car,2012,midwest,ohio,columbus,4000) (car,2012,midwest,ohio,,4000) (car,2012,midwest,,,4000) (car,2012,,,,4000) (car,,midwest,ohio,columbus,4000) (car,,midwest,ohio,,4000) (car,,midwest,,,4000) (car,,,,,4000) (,2012,midwest,ohio,columbus,4000) (,2012,midwest,ohio,,4000) (,2012,midwest,,,4000) (,2012,,,,4000) (,,midwest,ohio,columbus,4000) (,,midwest,ohio,,4000) (,,midwest,,,4000) (,,,,,4000) Output schema grunt> describe cubed_and_rolled; cubed_and_rolled: {group: (product: chararray,year: int ,region: chararray,state: chararray,city: chararray),cube: {(product: chararray,year: int ,region: chararray,state: chararray,city: chararray,sales: long )}} Handling null values in dimensions Since null values are used to represent subtotals in cube and rollup operation, in order to differentiate the legitimate null values that already exists as dimension values, CUBE operator converts any null values in dimensions to "unknown" value before performing cube or rollup operation. For example, for CUBE(product,location) with a sample tuple (car,null) the output will be {(car,unknown), (car,null), (null,unknown), (null,null)} .

## People

• Assignee:
Prasanth Jayachandran
Reporter:
Prasanth Jayachandran