Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
4.0.0
Description
Currently, Spark GROUP BY only allows orderable data types, otherwise the plan analysis fails: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala#L197-L203
However, this is too strict as GROUP BY only cares about equality, not ordering. The CalendarInterval type is not orderable (1 month and 30 days, we don't know which one is larger), but has well-defined equality. In fact, we already support `SELECT DISTINCT calendar_interval_type` in some cases (when hash aggregate is picked by the planner).
The proposal here is to officially support calendar interval type in GROUP BY. We should relax the check inside `CheckAnalysis`, and make `CalendarInterval` implements `Comparable` using natural ordering (compare months first, then days, then seconds), and test with both hash aggregate and sort aggregate.
Attachments
Issue Links
- links to