Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-46536

Support GROUP BY calendar_interval_type

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 4.0.0
    • 4.0.0
    • SQL

    Description

      Currently, Spark GROUP BY only allows orderable data types, otherwise the plan analysis fails: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala#L197-L203

      However, this is too strict as GROUP BY only cares about equality, not ordering. The CalendarInterval type is not orderable (1 month and 30 days, we don't know which one is larger), but has well-defined equality. In fact, we already support `SELECT DISTINCT calendar_interval_type` in some cases (when hash aggregate is picked by the planner).

      The proposal here is to officially support calendar interval type in GROUP BY. We should relax the check inside `CheckAnalysis`, and make `CalendarInterval` implements `Comparable` using natural ordering (compare months first, then days, then seconds), and test with both hash aggregate and sort aggregate.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              cloud_fan Wenchen Fan
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: