Affects Version/s: None
Fix Version/s: 1.14.0
Grouping sets are currently implemented in Calcite using a bit to indicate each
of the grouping columns. For instance, consider the following group by clause:
GROUP BY CUBE (a, b)
The generated Aggregate operator in Calcite will have a row schema consisting of [a, b, GROUPING(a), GROUPING(b)], where GROUPING( x ) is a boolean field indicator which represents whether x is participating in the group by clause.
In contrast, Hive's implementation stores a single number corresponding to the GROUPING bit vector associated with a row (this is the result of the GROUPING_ID function in RDBMS such as MSSQLServer, Oracle, etc). Thus, the row schema of the Aggregate operator is [a, b, GROUPING_ID(a,b)].
This difference is creating a mismatch between Calcite and Hive. As of now, we work around this mismatch in the Hive side: we create our own GROUPING_ID function applied over those columns. However, we have some issues related to predicates pushdown, constant propagation, join project transpose rule (HIVE-12923)
etc., that we need to continue solving as new rules are added to Hive optimizer. In short, this is making the code on the Hive side harder and harder to maintain.
This jira is intended to modify the implementation on the Calcite side to that we need not make workarounds/hacks in Hive to support Grouping IDs.