[PIG-4066] An optimization for ROLLUP operation in Pig - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.15.0
Component/s: None
Labels:

Patch Info:

Patch Available

Description

This patch aims at addressing the current limitation of the ROLLUP operator in PIG: most of the work is done in the Map phase of the underlying MapReduce job to generate all possible intermediate keys that the reducer use to aggregate and produce the ROLLUP output. Based on our previous work: “Duy-Hung Phan, Matteo Dell’Amico, Pietro Michiardi: On the design space of MapReduce ROLLUP aggregates” (http://www.eurecom.fr/en/publication/4212/download/rs-publi-4212_2.pdf), we show that the design space for a ROLLUP implementation allows for a different approach (in-reducer grouping, IRG), in which less work is done in the Map phase and the grouping is done in the Reduce phase. This patch presents the most efficient implementation we designed (Hybrid IRG), which allows defining a parameter to balance between parallelism (in the reducers) and communication cost.
This patch contains the following features:
1. The new ROLLUP approach: IRG, Hybrid IRG.
2. The PIVOT clause in CUBE operators.
3. Test cases.
The new syntax to use our ROLLUP approach:
alias = CUBE rel BY

{ CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]} [, { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]}

...]
In case there is multiple ROLLUP operator in one CUBE clause, the last ROLLUP operator will be executed with our approach (IRG, Hybrid IRG) while the remaining ROLLUP ahead will be executed with the default approach.
We have already made some experiments for comparison between our ROLLUP implementation and the current ROLLUP. More information can be found at here: http://hxquangnhat.github.io/PIG-ROLLUP-H2IRG/
Patch can be reviewed at here: https://reviews.apache.org/r/23804/

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PIG-4066.patch
22/Jul/14 09:14
156 kB
Quang-Nhat HOANG-XUAN
TechnicalNotes.pdf
22/Jul/14 09:14
182 kB
Quang-Nhat HOANG-XUAN
UserGuide.pdf
22/Jul/14 09:15
87 kB
Quang-Nhat HOANG-XUAN
Current Rollup vs Our Rollup.jpg
22/Jul/14 09:15
108 kB
Quang-Nhat HOANG-XUAN
TechnicalNotes.2.pdf
11/Sep/14 13:32
175 kB
Quang-Nhat HOANG-XUAN
PIG-4066.2.patch
11/Sep/14 14:27
145 kB
Quang-Nhat HOANG-XUAN
PIG-4066.3.patch
04/Nov/14 08:35
147 kB
Quang-Nhat HOANG-XUAN
PIG-4066.4.patch
17/Nov/14 10:50
146 kB
Quang-Nhat HOANG-XUAN
PIG-4066.5.patch
07/Dec/14 12:34
147 kB
Quang-Nhat HOANG-XUAN
PIG-4066-revert.patch
22/May/15 00:07
147 kB
Daniel Dai

Issue Links

is related to

IMPALA-7204 Add support for GROUP BY ROLLUP, CUBE and GROUPING SETS

Open

relates to

PIG-4566 Reimplement PIG-4066: An optimization for ROLLUP operation in Pig

Open

Activity

People

Assignee:: Quang-Nhat HOANG-XUAN

Reporter:: Quang-Nhat HOANG-XUAN

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 22/Jul/14 09:11

Updated:: 23/Jun/18 16:44

Resolved:: 12/Dec/14 20:17