# CUBE operation in Pig

## Details

• Type: New Feature
• Status: Open
• Priority: Major
• Resolution: Unresolved
• Affects Version/s: None
• Fix Version/s: None
• Component/s: None
• Labels:

## Description

Computing aggregates over a cube of several dimensions is a common operation in data warehousing.

The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" – which in addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL is generally used to represent "all".

A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce here: http://pdf.cx/44wrk

We can start with the naive solution which only works for algebraic measures, and work up from there.

## Attachments

1. PIG-2167.4.patch
62 kB
Prasanth Jayachandran
2. PIG-2167.3.patch
59 kB
Prasanth Jayachandran
3. PIG-2167.2.patch
58 kB
Prasanth Jayachandran
4. PIG-2167.1.patch
57 kB
Prasanth Jayachandran
5. Pig-Cubing-Performance.png
20 kB
Prasanth Jayachandran

 1 CubeDimensions UDF Closed 2 Implement Naive CUBE operator Closed 3 Handling legitimate NULL values Closed 4 Implementing RollupDimensions UDF and adding ROLLUP clause in CUBE operator Closed 5 MR-Cube implementation (Distributed cubing for holistic measures) Open

## Activity

Hide
Dmitriy V. Ryaboy added a comment -

I believe there is value to providing the naive solution and improving on it later, rather than trying to build the optimal plan from the get-go.

Initial (naive) implementation plan:

Add an optional "WITH CUBE" clause to the group operator.

In LogicalPlanBuilder, if "WITH CUBE" is present, insert operators equivalent to the following above the group operator:

```relation = foreach relation generate
FLATTEN(CubeDimensions(dim1, dim2, dim3))
as (dim1, dim2, dim3),
other_fields;
```

It may be desirable in some cases to group by a superset of dimensions one wants to cube on: group by dim1, dim2, dim3 with cube on (dim1, dim2). If we want to support that use case, we simply need to know to call the UDF on (dim1, dim2) and push dim3 into the other_fields list.

Note also that there's a bit of a problem if null values are legitimate values for the dimensions, as we use null to indicate "all". The UDF provided in PIG-2168 allows one to use custom strings instead of null for the "all" marker. We can optionally support this in the grammar, as well.

Show
Dmitriy V. Ryaboy added a comment - I believe there is value to providing the naive solution and improving on it later, rather than trying to build the optimal plan from the get-go. Initial (naive) implementation plan: Add an optional "WITH CUBE" clause to the group operator. In LogicalPlanBuilder, if "WITH CUBE" is present, insert operators equivalent to the following above the group operator: relation = foreach relation generate FLATTEN(CubeDimensions(dim1, dim2, dim3)) as (dim1, dim2, dim3), other_fields; It may be desirable in some cases to group by a superset of dimensions one wants to cube on: group by dim1, dim2, dim3 with cube on (dim1, dim2). If we want to support that use case, we simply need to know to call the UDF on (dim1, dim2) and push dim3 into the other_fields list. Note also that there's a bit of a problem if null values are legitimate values for the dimensions, as we use null to indicate "all". The UDF provided in PIG-2168 allows one to use custom strings instead of null for the "all" marker. We can optionally support this in the grammar, as well.
Hide
Prasanth Jayachandran added a comment -

Hello everyone

I am Prasanth Jayachandran, graduate student at The Ohio State University. I am working with Prof. Arnab Nandi for providing CUBE operator support in Pig. I am submitting the initial version of the CUBE operator implementation(naive version of cube materialization). As this is my first patch submission I am really excited about it and am hoping to continue my contribution for Apache Pig. Please review the attached patch and provide feedback from improvising it.

Following contents explains the design decision and some initial performance numbers (experiments performed on single node pseudo-distributed hadoop setup).
Pig syntax for Cubing
CUBE rel BY (a,b,c);

SQL/Oracle syntax for Cubing
GROUP BY CUBE (a,b,c);

CUBE operator internals
The CUBE operator injects the logical plan for following operators
x = FOREACH rel GENERATE FLATTEN(CubeDimensions(a,b,c));
y = GROUP x by (a,b,c);

What is the output schema of CUBE operator?
{group: tuple(a,b,c), cube: bag{(dimensions::a,dimensions::b,dimensions::c)}}

Why syntactically different from SQL/Oracle?

• Easier to implement as it does not modify or break the existing GROUP BY operator implementation
• CUBE operator might require separate flags for the following
• Switching between BUC and STAR cubing (future optimization)
• HAVING clause for monotonic operations
• ROLLUP/DRILLDOWN operations
• Hint the location of partially computed CUBE
• user specified inputs (example: algebraic attribute for converting the holistic measure to partially algebraic measure can be specified by user)
• Some operations applicable in GROUP operator are not applicable for CUBE
• Constant expression evaluation
• Duplicate column projection
• Follows Pig language design principle of procedural simplicity

Corner case handling
Constant expressions can be provided in GROUP BY operator. Constant expressions support has been removed from CUBE BY operator grammar. If constant expressions are used with CUBE BY, FrontEndException is thrown.
Duplicate column projection is supported in GROUP BY. Duplicates columns will be eliminated while generating logical plan. Current implementation ignores duplicates dimensions in CUBE BY. This can also be modified to throw exception if user repeats a dimension more than once.
If cube dimensions are a subset of columns in input schema then the remaining columns in the input schema will be pushed to the “cube” bag.
For example:
inp = LOAD ‘/pig/data/input’ AS (a,b,c,d);
x = CUBE inp BY (a,b);
schema of x will be

{group:tuple(a,b), cube:bag(dimensions::a,dimensions::b,c,d)}

Performance
Apache Pig Test Environment
OS: Ubuntu 11.04 running as guest OS in Virtual Box
Memory: 8GB
HDD: 100GB
Mode: Single node pseudo-distributed mode setup running Hadoop-0.20.2.
SQL Server 2008 R2 Test Environment
OS: Windows 7
Memory: 16GB
HDD: 500GB

Acknowledgements
Professor Arnab Nandi, Department of Computer Science and Engineering, The Ohio State University, Columbus for guidance and assistance throughout the course of this initial implementation.
Chaitanya Solarpurikar, Graduate Student, Department of Computer Science and Engineering, The Ohio State University, Columbus for setting up SQL server test environment and running performance comparison experiments.

Show
Hide
Dmitriy V. Ryaboy added a comment -

I wouldn't put too much weight on timing local mode. You are likely spending most of the time in JobControl's sleep(5000). We'll see about posting our patch for that.

Show
Dmitriy V. Ryaboy added a comment - I wouldn't put too much weight on timing local mode. You are likely spending most of the time in JobControl's sleep(5000). We'll see about posting our patch for that.
Hide
Dmitriy V. Ryaboy added a comment -

Assigning to self as per apache guidelines, as I'd like to mentor this. Putting my application in right now.

Show
Dmitriy V. Ryaboy added a comment - Assigning to self as per apache guidelines, as I'd like to mentor this. Putting my application in right now.
Hide
Prasanth Jayachandran added a comment -

Hello Everyone
I have added the proposal for cubing project at GSoC Proposal

Show
Hide
Prasanth Jayachandran added a comment -

The URL didn't redirect correctly. Pasting it again

Show
Hide
Dmitriy V. Ryaboy added a comment -

Prasanth, could you post your initial patch (the one already attached to this jira) to review board?

Show
Dmitriy V. Ryaboy added a comment - Prasanth, could you post your initial patch (the one already attached to this jira) to review board?
Hide
Prasanth Jayachandran added a comment -

Posted in review board https://reviews.apache.org/r/4670/

Let me know in case if missed some details.

Show
Prasanth Jayachandran added a comment - Posted in review board https://reviews.apache.org/r/4670/ Let me know in case if missed some details.
Hide
Prasanth Jayachandran added a comment -

Submitting patch(version 2) with updates based on code review comments by Dmitriy.

Show
Prasanth Jayachandran added a comment - Submitting patch(version 2) with updates based on code review comments by Dmitriy.
Hide
Dmitriy V. Ryaboy added a comment -

Prasanth, seems like you didn't update the RB?

Makes it hard to see what changed between v.1 and v.2

Show
Dmitriy V. Ryaboy added a comment - Prasanth, seems like you didn't update the RB? Makes it hard to see what changed between v.1 and v.2
Hide
Prasanth Jayachandran added a comment -

Dmitriy,

I have updated v2 of the patch to the RB just after replying to your review comments.

Here is the link for the diff against the baseline
https://reviews.apache.org/r/4670/diff/2/

Here is the link for diff against Patch-1

Please let me know if I missed out something while updating.

Show
Prasanth Jayachandran added a comment - Dmitriy, I have updated v2 of the patch to the RB just after replying to your review comments. Here is the link for the diff against the baseline https://reviews.apache.org/r/4670/diff/2/ Here is the link for diff against Patch-1 https://reviews.apache.org/r/4670/diff/1-2/#index_header Please let me know if I missed out something while updating.
Hide
Dmitriy V. Ryaboy added a comment -

Quite right, I was looking at a different RB. Sorry bout that. Reviewing .

Show
Dmitriy V. Ryaboy added a comment - Quite right, I was looking at a different RB. Sorry bout that. Reviewing .
Hide

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4670/
-----------------------------------------------------------

(Updated 2012-04-30 04:05:03.466616)

Review request for pig and Dmitriy Ryaboy.

Changes
-------

Updated JIRA issue field to reflect just the issue number.

Summary
-------

This is a review board for https://issues.apache.org/jira/browse/PIG-2167

https://issues.apache.org/jira/browse/PIG-2167

Diffs

Testing
-------

Unit tests: All passed

Pre-commit tests: All passed
ant clean test-commit

Thanks,

Prasanth_J

Show
jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4670/ ----------------------------------------------------------- (Updated 2012-04-30 04:05:03.466616) Review request for pig and Dmitriy Ryaboy. Changes ------- Updated JIRA issue field to reflect just the issue number. Summary ------- This is a review board for https://issues.apache.org/jira/browse/PIG-2167 This addresses bug PIG-2167 . https://issues.apache.org/jira/browse/PIG-2167 Diffs http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/optimizer/AllExpressionVisitor.java 1325115 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/optimizer/SchemaResetter.java 1325115 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LOCube.java PRE-CREATION http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LogicalRelationalNodesVisitor.java 1325115 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/visitor/ProjectStarExpander.java 1325115 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/AliasMasker.g 1325115 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/AstPrinter.g 1325115 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/AstValidator.g 1325115 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/LogicalPlanBuilder.java 1325115 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/LogicalPlanGenerator.g 1325115 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/QueryLexer.g 1325115 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/QueryParser.g 1325115 http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestLexer.pig 1325115 http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestLogicalPlanGenerator.java 1325115 http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestParser.pig 1325115 http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestQueryLexer.java 1325115 http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestQueryParser.java 1325115 http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/TestCubeOperator.java PRE-CREATION Diff: https://reviews.apache.org/r/4670/diff Testing ------- Unit tests: All passed Pre-commit tests: All passed ant clean test-commit Thanks, Prasanth_J
Hide

On 2012-04-30 00:54:08, Dmitriy Ryaboy wrote:

>

> I see, you cut this out in LogicalPlanGenerator so you assume this won't get called before the plan is generated. That'll probably work for now. We should definitely fix this in the next stage.

Yeah. Will fix that in the next stage (implementing physical operator).

On 2012-04-30 00:54:08, Dmitriy Ryaboy wrote:

>

> you can merge this try/catch with the one above.

Done.

On 2012-04-30 00:54:08, Dmitriy Ryaboy wrote:

>

> try Util.createFile(String[] data)

>

> Even better, you can now use the new MockLoader from PIG-2650 (save a bit more time and complexity on the tests..)

Updated. Now all test cases use MockLoader.

On 2012-04-30 00:54:08, Dmitriy Ryaboy wrote:

>

> return here

Updated.

On 2012-04-30 00:54:08, Dmitriy Ryaboy wrote:

>

> Assert.fail("Expected to throw an exception when duplicate dimensions are detected!");

>

> (otherwise your test passes even if this doesn't work!)

Updated.

On 2012-04-30 00:54:08, Dmitriy Ryaboy wrote:

>

> ExecMode.LOCAL iirc

PigServer constructor accepts string ('local') or ExecType (ExecType.LOCAL). If its a string it internally converts to corresponding ExecType.

• Prasanth_J

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4670/#review7373
-----------------------------------------------------------

On 2012-04-30 04:05:03, Prasanth_J wrote:

-----------------------------------------------------------

This is an automatically generated e-mail. To reply, visit:

https://reviews.apache.org/r/4670/

-----------------------------------------------------------

(Updated 2012-04-30 04:05:03)

Review request for pig and Dmitriy Ryaboy.

Summary

-------

This is a review board for https://issues.apache.org/jira/browse/PIG-2167

https://issues.apache.org/jira/browse/PIG-2167

Diffs

-----

Testing

-------

Unit tests: All passed

Pre-commit tests: All passed

ant clean test-commit

Thanks,

Prasanth_J

Show
jiraposter@reviews.apache.org added a comment - On 2012-04-30 00:54:08, Dmitriy Ryaboy wrote: > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LOCube.java , line 74 > < https://reviews.apache.org/r/4670/diff/1-2/?file=100633#file100633line74 > > > I see, you cut this out in LogicalPlanGenerator so you assume this won't get called before the plan is generated. That'll probably work for now. We should definitely fix this in the next stage. Yeah. Will fix that in the next stage (implementing physical operator). On 2012-04-30 00:54:08, Dmitriy Ryaboy wrote: > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/LogicalPlanBuilder.java , line 561 > < https://reviews.apache.org/r/4670/diff/1-2/?file=100639#file100639line561 > > > you can merge this try/catch with the one above. Done. On 2012-04-30 00:54:08, Dmitriy Ryaboy wrote: > http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/TestCubeOperator.java , line 58 > < https://reviews.apache.org/r/4670/diff/1-2/?file=100648#file100648line58 > > > try Util.createFile(String[] data) > > Even better, you can now use the new MockLoader from PIG-2650 (save a bit more time and complexity on the tests..) Updated. Now all test cases use MockLoader. On 2012-04-30 00:54:08, Dmitriy Ryaboy wrote: > http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/TestCubeOperator.java , line 302 > < https://reviews.apache.org/r/4670/diff/1-2/?file=100648#file100648line302 > > > return here Updated. On 2012-04-30 00:54:08, Dmitriy Ryaboy wrote: > http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/TestCubeOperator.java , line 318 > < https://reviews.apache.org/r/4670/diff/1-2/?file=100648#file100648line318 > > > Assert.fail("Expected to throw an exception when duplicate dimensions are detected!"); > > (otherwise your test passes even if this doesn't work!) Updated. On 2012-04-30 00:54:08, Dmitriy Ryaboy wrote: > http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/TestCubeOperator.java , line 74 > < https://reviews.apache.org/r/4670/diff/1-2/?file=100648#file100648line74 > > > ExecMode.LOCAL iirc PigServer constructor accepts string ('local') or ExecType (ExecType.LOCAL). If its a string it internally converts to corresponding ExecType. Prasanth_J ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4670/#review7373 ----------------------------------------------------------- On 2012-04-30 04:05:03, Prasanth_J wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4670/ ----------------------------------------------------------- (Updated 2012-04-30 04:05:03) Review request for pig and Dmitriy Ryaboy. Summary ------- This is a review board for https://issues.apache.org/jira/browse/PIG-2167 This addresses bug PIG-2167 . https://issues.apache.org/jira/browse/PIG-2167 Diffs ----- http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/optimizer/AllExpressionVisitor.java 1325115 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/optimizer/SchemaResetter.java 1325115 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LOCube.java PRE-CREATION http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LogicalRelationalNodesVisitor.java 1325115 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/visitor/ProjectStarExpander.java 1325115 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/AliasMasker.g 1325115 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/AstPrinter.g 1325115 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/AstValidator.g 1325115 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/LogicalPlanBuilder.java 1325115 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/LogicalPlanGenerator.g 1325115 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/QueryLexer.g 1325115 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/QueryParser.g 1325115 http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestLexer.pig 1325115 http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestLogicalPlanGenerator.java 1325115 http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestParser.pig 1325115 http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestQueryLexer.java 1325115 http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestQueryParser.java 1325115 http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/TestCubeOperator.java PRE-CREATION Diff: https://reviews.apache.org/r/4670/diff Testing ------- Unit tests: All passed Pre-commit tests: All passed ant clean test-commit Thanks, Prasanth_J
Hide
Prasanth Jayachandran added a comment -

I am not able to attach the patch(created using git diff) to review board. So attaching it here.

Show
Prasanth Jayachandran added a comment - I am not able to attach the patch(created using git diff) to review board. So attaching it here.
Hide

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4670/#review7484
-----------------------------------------------------------

This review took a little time as I got interrupted in the middle.
Great job overall.
Let me know what you think.

All those for loops could be factored into a private visitAll(Collection<LogicalExpressionPlan>)

same comment.
You can factor out the for loops in those methods.

chain all exceptions

same

do you know if there's a way to avoid duplication across .g files?

you can make this a javadoc comment

this is not affected to anything. Is it intended?

same comment. I guess it is intended, but it looks strange. If the constructor has side effect, maybe there should be another way

add the name of the duplicate in the error message

you van now use mock.Storage() for those tests.
see PIG-2650

Use the message part of the assert statement to ensure a good error message when it fails.

Tuple t = it.next()
assertTrue(expected+" contains "+t, expected.contains(t))

• Julien

On 2012-04-30 04:05:03, Prasanth_J wrote:

-----------------------------------------------------------

This is an automatically generated e-mail. To reply, visit:

https://reviews.apache.org/r/4670/

-----------------------------------------------------------

(Updated 2012-04-30 04:05:03)

Review request for pig and Dmitriy Ryaboy.

Summary

-------

This is a review board for https://issues.apache.org/jira/browse/PIG-2167

https://issues.apache.org/jira/browse/PIG-2167

Diffs

-----

Testing

-------

Unit tests: All passed

Pre-commit tests: All passed

ant clean test-commit

Thanks,

Prasanth_J

Show
Hide

On 2012-05-05 00:09:48, Julien Le Dem wrote:

> This review took a little time as I got interrupted in the middle.

> Great job overall.

> Let me know what you think.

Thanks for the review julien. I made patch v3 based on Dmitriy's review comments but was not able to upload in RB because git diff's are compatible with RB. I have incorporated most of your review comments in this patch (v4).

On 2012-05-05 00:09:48, Julien Le Dem wrote:

>

The MultiMap class under org.apache.pig.impl.util package doesn't seem to support iteration using entrySet.

On 2012-05-05 00:09:48, Julien Le Dem wrote:

>

> All those for loops could be factored into a private visitAll(Collection<LogicalExpressionPlan>)

Good catch. Added new private method and refactored all occurrences of for loop to use this method.

On 2012-05-05 00:09:48, Julien Le Dem wrote:

>

> same comment.

> You can factor out the for loops in those methods.

On 2012-05-05 00:09:48, Julien Le Dem wrote:

>

> chain all exceptions

Modified.

On 2012-05-05 00:09:48, Julien Le Dem wrote:

>

> you can make this a javadoc comment

is javadoc generated for private methods as well? Since this is a private method I kept it as a simple comment.

On 2012-05-05 00:09:48, Julien Le Dem wrote:

>

> add the name of the duplicate in the error message

On 2012-05-05 00:09:48, Julien Le Dem wrote:

>

> you van now use mock.Storage() for those tests.

> see PIG-2650

Updated from patch v3 onwards.

On 2012-05-05 00:09:48, Julien Le Dem wrote:

>

> Use the message part of the assert statement to ensure a good error message when it fails.

>

> Tuple t = it.next()

> assertTrue(expected+" contains "+t, expected.contains(t))

Updated.

On 2012-05-05 00:09:48, Julien Le Dem wrote:

>

> same comment. I guess it is intended, but it looks strange. If the constructor has side effect, maybe there should be another way

This inserts a project expression to logical expression plan (lEplan in this case). I found this way in other files too. I could not find a better way to attach a project expression to a logical expression plan. Do you have any other thoughts/ways for doing this?

On 2012-05-05 00:09:48, Julien Le Dem wrote:

>

Updated.

On 2012-05-05 00:09:48, Julien Le Dem wrote:

>

> do you know if there's a way to avoid duplication across .g files?

Since all these grammar files serves different purposes, I am not sure if it is possible to avoid duplicates.

• Prasanth_J

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4670/#review7484
-----------------------------------------------------------

On 2012-04-30 04:05:03, Prasanth_J wrote:

-----------------------------------------------------------

This is an automatically generated e-mail. To reply, visit:

https://reviews.apache.org/r/4670/

-----------------------------------------------------------

(Updated 2012-04-30 04:05:03)

Review request for pig and Dmitriy Ryaboy.

Summary

-------

This is a review board for https://issues.apache.org/jira/browse/PIG-2167

https://issues.apache.org/jira/browse/PIG-2167

Diffs

-----

Testing

-------

Unit tests: All passed

Pre-commit tests: All passed

ant clean test-commit

Thanks,

Prasanth_J

Show
Hide

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4670/
-----------------------------------------------------------

(Updated 2012-05-06 00:13:16.285229)

Review request for pig and Dmitriy Ryaboy.

Changes
-------

Incorporated Dmitriy's and Julien's review comments.

Unit tests: All passed

Pre-commit tests: All passed
ant clean test-commit

Summary
-------

This is a review board for https://issues.apache.org/jira/browse/PIG-2167

https://issues.apache.org/jira/browse/PIG-2167

Diffs (updated)

Testing
-------

Unit tests: All passed

Pre-commit tests: All passed
ant clean test-commit

Thanks,

Prasanth_J

Show
jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4670/ ----------------------------------------------------------- (Updated 2012-05-06 00:13:16.285229) Review request for pig and Dmitriy Ryaboy. Changes ------- Incorporated Dmitriy's and Julien's review comments. Unit tests: All passed Pre-commit tests: All passed ant clean test-commit Summary ------- This is a review board for https://issues.apache.org/jira/browse/PIG-2167 This addresses bug PIG-2167 . https://issues.apache.org/jira/browse/PIG-2167 Diffs (updated) http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestLogicalPlanGenerator.java 1334534 http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestParser.pig 1334534 http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestQueryLexer.java 1334534 http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestQueryParser.java 1334534 http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/TestCubeOperator.java PRE-CREATION http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/AstPrinter.g 1334534 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/AstValidator.g 1334534 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/LogicalPlanBuilder.java 1334534 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/LogicalPlanGenerator.g 1334534 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/QueryLexer.g 1334534 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/QueryParser.g 1334534 http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/parser/TestLexer.pig 1334534 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LOCube.java PRE-CREATION http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LogicalRelationalNodesVisitor.java 1334534 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/visitor/ProjectStarExpander.java 1334534 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/AliasMasker.g 1334534 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/optimizer/SchemaResetter.java 1334534 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/optimizer/AllExpressionVisitor.java 1334534 Diff: https://reviews.apache.org/r/4670/diff Testing ------- Unit tests: All passed Pre-commit tests: All passed ant clean test-commit Thanks, Prasanth_J
Hide
Prasanth Jayachandran added a comment -

Modifications:
1) Testing uses mock.Storage()
2) Chaining exceptions
3) Refactored iterative visits to a separate private method in logical optimizers

Show
Prasanth Jayachandran added a comment - Uploading patch with Julien's review comments incorporated. Modifications: 1) Testing uses mock.Storage() 2) Chaining exceptions 3) Refactored iterative visits to a separate private method in logical optimizers
Hide
Prasanth Jayachandran added a comment -

Hello everyone

I will be working with Dmitriy on this issue for GSoC 2012 which is about to start next week (May 21). Before getting started with the project I would like to hear from the community about the syntax for CUBE (and related) operators. As a part of this project I also have plans for implementing hierarchy based grouping, partial grouping and monotonicity based pruning using HAVING clause.

Following are the list of tasks that require modifications to the grammar file and for each of them I have included SQL/Oracle syntax, the proposed syntax for pig and the corresponding output groups. I would like to know feedback for each of the suggested syntax and finalize the syntax for all of them.

1) CUBE operation (all combinations of grouping)
SQL/Oracle syntax

`GROUP BY CUBE (a,b,c);`

Current syntax in Pig

`alias = CUBE rel BY (a,b,c);`

Output groups:

{(a,b,c), (a,b), (b,c), (a,c), (a), (b), (c), ()}

2) ROLLUP operation (hierarchy based grouping)
SQL/Oracle syntax

`GROUP BY ROLLUP (a,b,c);`

Proposed syntax for Pig:

`alias = CUBE rel BY ROLLUP(a,b,c);`

Output groups:

{(a,b,c), (a,b), (a), ()}

3) GROUPING SETS operation (partial grouping)
SQL/Oracle syntax

`GROUP BY GROUPING SETS ((a,b),(b),());`

Proposed syntax for Pig:

`alias = CUBE rel BY GROUPING SETS((a,b),(b),());`

Output groups:

{(a,b), (b), ()}

4) HAVING clause (for pruning the groups that does not satisfy the specified condition)
SQL/Oracle syntax:

```GROUP BY CUBE (x,y)
HAVING SUM(z)>100```

Proposed syntax for Pig:

`alias = CUBE rel BY (a,b,c) HAVING SUM(d)>100;`

Output groups:
Only the groups that satisfy the specified condition appears in the output

The primary reasons for having a separate CUBE operator are

• Ease of implementation without modification to existing GROUP BY operator implementation
• Ability to have a separate Physical Operator (POCube)
• Cube specific optimizations can be applied to the physical operator
• Some operations applicable to GROUP operator are not applicable for CUBE
• Constant expression evaluation
• Duplicate column projection

References:

Show
Prasanth Jayachandran added a comment - Hello everyone I will be working with Dmitriy on this issue for GSoC 2012 which is about to start next week (May 21). Before getting started with the project I would like to hear from the community about the syntax for CUBE (and related) operators. As a part of this project I also have plans for implementing hierarchy based grouping, partial grouping and monotonicity based pruning using HAVING clause. Following are the list of tasks that require modifications to the grammar file and for each of them I have included SQL/Oracle syntax, the proposed syntax for pig and the corresponding output groups. I would like to know feedback for each of the suggested syntax and finalize the syntax for all of them. 1) CUBE operation (all combinations of grouping) SQL/Oracle syntax GROUP BY CUBE (a,b,c); Current syntax in Pig alias = CUBE rel BY (a,b,c); Output groups: {(a,b,c), (a,b), (b,c), (a,c), (a), (b), (c), ()} 2) ROLLUP operation (hierarchy based grouping) SQL/Oracle syntax GROUP BY ROLLUP (a,b,c); Proposed syntax for Pig: alias = CUBE rel BY ROLLUP(a,b,c); Output groups: {(a,b,c), (a,b), (a), ()} 3) GROUPING SETS operation (partial grouping) SQL/Oracle syntax GROUP BY GROUPING SETS ((a,b),(b),()); Proposed syntax for Pig: alias = CUBE rel BY GROUPING SETS((a,b),(b),()); Output groups: {(a,b), (b), ()} 4) HAVING clause (for pruning the groups that does not satisfy the specified condition) SQL/Oracle syntax: GROUP BY CUBE (x,y) HAVING SUM(z)>100 Proposed syntax for Pig: alias = CUBE rel BY (a,b,c) HAVING SUM(d)>100; Output groups: Only the groups that satisfy the specified condition appears in the output The primary reasons for having a separate CUBE operator are Ease of implementation without modification to existing GROUP BY operator implementation Ability to have a separate Physical Operator (POCube) Cube specific optimizations can be applied to the physical operator Some operations applicable to GROUP operator are not applicable for CUBE Constant expression evaluation Duplicate column projection Please let me know your feedback/suggestions for the operator syntax. References: Oracle documentation for CUBE http://docs.oracle.com/cd/B19306_01/server.102/b14223/aggreg.htm#i1012749 SQL server documentation for CUBE http://msdn.microsoft.com/en-us/library/ms177673.aspx Simple CUBE/ROLLUP operations in Oracle and SQL http://sqlfiddle.com/#!3/3bba2/11 http://sqlfiddle.com/#!4/32c81/26
Hide
Dmitriy V. Ryaboy added a comment -

All looks good except for HAVING. The reason SQL has a HAVING clause is that it doesn't have a way to express post-group filtering without using subselects. In Pig Latin, we do – it's just a FILTER clause that follows your grouping clause. So the same effect can be achieved:

```alias = CUBE rel BY (a,b,c);
filtered_alias = FILTER alias BY SUM(rel.d) > 10;
```

Note that this assumes the schema of alias is (group, rel) where "rel" is a complex tuple representing a matching row from the original relation (the same way the GROUP operator currently works).

D

Show
Dmitriy V. Ryaboy added a comment - All looks good except for HAVING. The reason SQL has a HAVING clause is that it doesn't have a way to express post-group filtering without using subselects. In Pig Latin, we do – it's just a FILTER clause that follows your grouping clause. So the same effect can be achieved: alias = CUBE rel BY (a,b,c); filtered_alias = FILTER alias BY SUM(rel.d) > 10; Note that this assumes the schema of alias is (group, rel) where "rel" is a complex tuple representing a matching row from the original relation (the same way the GROUP operator currently works). D
Hide
Dmitriy V. Ryaboy added a comment -

Naive Cube implementation looks good. +1, will commit.

Let's create subtickets for incremental improvements as you have outlined them in the GSoC proposal, and continue discussion in those. Might as well create a subticket for the naive one (otherwise I have nothing to close out ).

Show
Dmitriy V. Ryaboy added a comment - Naive Cube implementation looks good. +1, will commit. Let's create subtickets for incremental improvements as you have outlined them in the GSoC proposal, and continue discussion in those. Might as well create a subticket for the naive one (otherwise I have nothing to close out ).
Hide
Russell Jurney added a comment -

Playing with this now, applied it to 0.10. Thanks!

Show
Russell Jurney added a comment - Playing with this now, applied it to 0.10. Thanks!
Hide
Prasanth Jayachandran added a comment -

Russell.. Try using the patch in JIRA-2170. I uploaded a new one yesterday which fixes a exception with illustrate operator..

Show
Prasanth Jayachandran added a comment - Russell.. Try using the patch in JIRA-2170. I uploaded a new one yesterday which fixes a exception with illustrate operator..
Hide
Russell Jurney added a comment -

Thanks prasanth, but which JIRA do you mean? PIG-2170 hasn't been updated since august?

Show
Russell Jurney added a comment - Thanks prasanth, but which JIRA do you mean? PIG-2170 hasn't been updated since august?
Hide
Prasanth Jayachandran added a comment -

Oops. My bad. Its PIG-2710 actually.

Show
Prasanth Jayachandran added a comment - Oops. My bad. Its PIG-2710 actually.
Hide

Hi Prasanth,
I'm interesting in data cube and map reduce. So I am reading the source code of CUBE operation and the associated paper.
In the paper <Distributed Cube Materialization on Holistic Measures>, it mentions "batch areas", have you implement it here?
If "batch areas" has been implemented, could you please tell me where is it? I can't find it. Thank you so much.

Show
zhchbin added a comment - Hi Prasanth, I'm interesting in data cube and map reduce. So I am reading the source code of CUBE operation and the associated paper. In the paper <Distributed Cube Materialization on Holistic Measures>, it mentions "batch areas", have you implement it here? If "batch areas" has been implemented, could you please tell me where is it? I can't find it. Thank you so much.
Hide
Prasanth Jayachandran added a comment -

zhchbin Thanks for your interest. The cube work in Pig had many parts to it, PIG-2167 adds the syntax for CUBE and ROLLUP operation. It also rewrites the logical plan to insert the cubing and rollup udfs. PIG-2167 is just a syntactic sugar over the UDFs. The main idea of the MRCube algorithm is to handle holistic measures (PIG-2831). PIG-2831 patch is way too outdated and is not reviewed yet. So there might be many rough edges. PIG-2831 does the following
1) Identify holistic measure and insert logical cube/rollup operators with some additional information
2) Identify algebraic attributes (partitioning attribute)
3) Inserts sampling job into MRPlan (added new sampling algorithm that reads N records randomly without reading the entire file)
5) Inserts post-processing job to aggregate the partitioned results

All the above steps are required for value-partitioning algorithm for data distribution. It does not implement the batch-area identification (distributing the computation) as mentioned in the paper. It might be worthwhile to start off with value partitioning followed by batch areas and other optimizations like BUC etc. I would be more than happy to help you with any of these steps in my free time.

Show

## People

• Assignee:
Prasanth Jayachandran
Reporter:
Dmitriy V. Ryaboy