Ashutosh Chauhan I think that for those two cases with hive.optimize.correlation=true, the ordering of key columns does not matter. Because in those queries, we only need to group rows, either [key, value] or [value, key] should be fine for the RS. The reason that I preserved the ordering in Correlation Optimizer is ReduceSinkDeDuplication can merge the RS for ORDER BY with another RS (for example, GROUP BY). In this case, ordering matters. When Correlation Optimizer gets the operator tree, it does not know if the key columns in a RS is only used for grouping or those columns are also used for ordering. I think it may be better to annotate what columns are used for grouping and what columns are used for sorting.
Chun Chen For your change, what will be the plan for the following query?
select c3, c2 from (select c1, c2, c3 from t1 order by c1, c2, c3) t group by c3, c2;
If we use [c1, c2, c3] as the key columns, rows with the same [c3, c2] are not grouped at the reduce side.
Based on my understanding, right now, the checkExprs in ReduceSinkDeDuplication only wants to handle cases that ckeys starts with pkeys, or pkeys starts with ckeys. For example, pkeys = [c1, c2, c3], and ckeys = [c1, c2].