Goal

Provide a way to rewrite queries with combination of COUNT(Distinct) and Aggregates like SUM as a series of Group By.
This can be useful to push down to Druid queries like

 select count(DISTINCT interval_marker), count (distinct dim), sum(num_l) FROM druid_test_table GROUP  BY `__time`, `zone` ;

In general this can be useful to be used in cases where storage handlers can not perform count (distinct column)

How to do it.

Use the Calcite rule

 org.apache.calcite.rel.rules.AggregateExpandDistinctAggregatesRule

that breaks down Count distinct to a single Group by with Grouping sets or multiple series of Group by that might be linked with Joins if multiple counts are present.
FYI today Hive does have a similar rule

 org.apache.hadoop.hive.ql.optimizer.calcite.rules.HiveExpandDistinctAggregatesRule

, but it only provides a rewrite to Grouping sets based plan.
I am planing to use the actual Calcite rule, ashutoshc any concerns or caveats to be aware of?

Concerns/questions

Need to have a way to switch between Grouping sets or Simple chained group by based on the plan cost. For instance for Druid based scan it makes always sense (at least today) to push down a series of Group by and stitch result sets in Hive later (as oppose to scan everything).
But this might be not true for other storage handler that can handle Grouping sets it is better to push down the Grouping sets as one table scan.
Am still unsure how i can lean on the cost optimizer to select the best plan, ashutoshc/jcamachorodriguez any inputs?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-19586.patch
17/May/18 23:40
43 kB
Slim Bouguerra
HIVE-19586.6.patch
31/May/18 13:53
52 kB
Slim Bouguerra
HIVE-19586.5.patch
28/May/18 05:12
53 kB
Ashutosh Chauhan
HIVE-19586.4.patch
25/May/18 12:56
52 kB
Slim Bouguerra
HIVE-19586.3.patch
18/May/18 19:27
52 kB
Slim Bouguerra
HIVE-19586.3.patch
23/May/18 12:09
52 kB
Slim Bouguerra
HIVE-19586.2.patch
18/May/18 00:00
40 kB
Slim Bouguerra

Sub-Tasks

1.	Hive and Calcite have different semantics for Grouping sets	Open	Unassigned
2.	Unsupported Post join function 'IS NOT DISTINCT FROM'	Open	Unassigned
3.	Pushing Aggregates on Top of Aggregates	Open	Unassigned

Activity

People

Assignee:: Slim Bouguerra

Reporter:: Slim Bouguerra

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 17/May/18 15:06

Updated:: 27/Feb/24 22:23

Resolved:: 03/Jun/18 16:16