Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27790

Support ANSI SQL INTERVAL types

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.2.0
    • 3.3.0
    • SQL

    Description

      Spark has an INTERVAL data type, but it is “broken”:

      1. It cannot be persisted
      2. It is not comparable because it crosses the month day line. That is there is no telling whether “1 Month 1 Day” is equal to “1 Month 1 Day” since not all months have the same number of days.

      I propose here to introduce the two flavours of INTERVAL as described in the ANSI SQL Standard and deprecate the Sparks interval type.

      • ANSI describes two non overlapping “classes”:
        • YEAR-MONTH,
        • DAY-SECOND ranges
      • Members within each class can be compared and sorted.
      • Supports datetime arithmetic
      • Can be persisted.

      The old and new flavors of INTERVAL can coexist until Spark INTERVAL is eventually retired. Also any semantic “breakage” can be controlled via legacy config settings.

      Milestone 1 – Spark Interval equivalency ( The new interval types meet or exceed all function of the existing SQL Interval):

      • Add two new DataType implementations for interval year-month and day-second. Includes the JSON format and DLL string.
      • Infra support: check the caller sides of DateType/TimestampType
      • Support the two new interval types in Dataset/UDF.
      • Interval literals (with a legacy config to still allow mixed year-month day-seconds fields and return legacy interval values)
      • Interval arithmetic(interval * num, interval / num, interval +/- interval)
      • Datetime functions/operators: Datetime - Datetime (to days or day second), Datetime +/- interval
      • Cast to and from the new two interval types, cast string to interval, cast interval to string (pretty printing), with the SQL syntax to specify the types
      • Support sorting intervals.

      Milestone 2 – Persistence:

      • Ability to create tables of type interval
      • Ability to write to common file formats such as Parquet and JSON.
      • INSERT, SELECT, UPDATE, MERGE
      • Discovery

      Milestone 3 – Client support

      • JDBC support
      • Hive Thrift server

      Milestone 4 – PySpark and Spark R integration

      • Python UDF can take and return intervals
      • DataFrame support

      Attachments

        Issue Links

          Activity

            People

              maxgekk Max Gekk
              maxgekk Max Gekk
              Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: