[TAJO-601] Improve distinct aggregation query processing - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.8.0
Component/s: Planner/Optimizer
Labels:
None

Description

Currently, distinct aggregation queries are executed as follows:

the first stage: it just shuffles tuples by hashing grouping keys.
the second stage: it sorts them and executes sort aggregation.

This way executes queries including distinct aggregation functions with only two stages. But, it leads to large intermediate data during shuffle phase.

This kind of query can be rewritten as two queries:

original query

SELECT grp1, grp2, count(*) as total, count(distinct grp3) as distinct_col from rel1 group by grp1, grp2;

rewritten query

SELECT grp1, grp2, sum(cnt) as total, count(grp3) as distinct_col from (
  SELECT grp1, grp2, grp3, count(*) as cnt from rel1 group by grp1, grp2, grp3) tmp1 group by grp1, grp2
) table1;

I'm expecting that this rewrite will significantly reduce the intermediate data volume and query response time in most cases.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

TAJO-601.patch
18/Feb/14 12:01
83 kB
Hyunsik Choi
TAJO-601_140220_142800.patch
20/Feb/14 05:28
83 kB
Hyunsik Choi

Issue Links

relates to

TAJO-1010 Improve multiple DISTINCT aggregation.

Resolved

Activity

People

Assignee:: Hyunsik Choi

Reporter:: Hyunsik Choi

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 14/Feb/14 12:08

Updated:: 18/Aug/14 09:53

Resolved:: 20/Feb/14 06:22