The current approach has shown poor performance. You can see the current approach in the description of this issue.
This patch improves the performance of distinct aggregation. Unlike the current approach, in the this patch, GlobalPlanner builds three phase plan using two hash shuffles. Then, GlobalPlanner adds an enforcer of sort aggregation to the final execution block. As a result, it can reduce significantly intermediate data volume according to the cardinality of grouping columns.
This patch also allows Tajo to support multiple distinct functions. For example, the following query works well.
select l_orderkey, count(distinct l_partkey), sum(distinct l_partkey) from lineitem group by l_orderkey;
But, the current patch still has some limitations. The above query includes there are two count distinct functions: count(distinct), sum(distinct). They use the same distinct column 'l_partkey', so it works well. In contrast, the following case where there are two or more distinct columns is not supported yet.
select l_orderkey, count(distinct l_partkey), sum(distinct l_linenumber) from lineitem group by l_orderkey;
If you submit such a query, you will see the following messages: "different DISTINCT columns are not supported yet: l_partkey, l_linenumber". In order to support this kind of queries, we need additional physical executors. I'll add this feature later in another Jira issue.