[HIVE-6120] Add GroupBy optimization to eliminate un-needed partial distinct aggregations - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Patch Available
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Query Processor
Labels:
None

Description

In most cases, partial distinct aggregation is not needed in map-side groupby. The exception is that with sorted bucketized tables partial distinct aggregation can be done by the mappers in some scenarios, as what is done by GroupByOptimzer.

Currently, partial distinct aggregation is done in the map-side GroupBy and then shuffle of the partial result is done in the following ReduceSink operator, in cases where they are not needed. This wastes CPU cycles, memory and network bandwidth.

This optimization eliminates un-needed partial distinct aggregations, which improves performance and reduces memory usage.

For example,
EXPLAIN SELECT key, count(DISTINCT value) FROM src GROUP BY key;

Before optimization:

              Group By Operator
                aggregations:
                      expr: count(DISTINCT value)
                bucketGroup: false
                keys:
                      expr: key
                      type: int
                      expr: value
                      type: string
                mode: hash
                outputColumnNames: _col0, _col1, _col2
                Reduce Output Operator
                  key expressions:
                        expr: _col0
                        type: int
                        expr: _col1
                        type: string
                  sort order: ++
                  Map-reduce partition columns:
                        expr: _col0
                        type: int
                  tag: -1
                  value expressions:
                        expr: _col2
                        type: bigint

After optimization:

              Group By Operator
                bucketGroup: false
                keys:
                      expr: key
                      type: int
                      expr: value
                      type: string
                mode: hash
                outputColumnNames: _col0, _col1
                Reduce Output Operator
                  key expressions:
                        expr: _col0
                        type: int
                        expr: _col1
                        type: string
                  sort order: ++
                  Map-reduce partition columns:
                        expr: _col0
                        type: int
                  tag: -1

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-6120.1.patch
29/Dec/13 13:42
81 kB
Sun Rui
HIVE-6120.2.patch
30/Dec/13 01:41
113 kB
Sun Rui

Activity

People

Assignee:: Sun Rui

Reporter:: Sun Rui

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 29/Dec/13 13:38

Updated:: 30/Dec/13 03:57