Spark has an INTERVAL data type, but it is “broken”:
- It cannot be persisted
- It is not comparable because it crosses the month day line. That is there is no telling whether “1 Month 1 Day” is equal to “1 Month 1 Day” since not all months have the same number of days.
I propose here to introduce the two flavours of INTERVAL as described in the ANSI SQL Standard and deprecate the Sparks interval type.
- ANSI describes two non overlapping “classes”:
- DAY-SECOND ranges
- Members within each class can be compared and sorted.
- Supports datetime arithmetic
- Can be persisted.
The old and new flavors of INTERVAL can coexist until Spark INTERVAL is eventually retired. Also any semantic “breakage” can be controlled via legacy config settings.
Milestone 1 – Spark Interval equivalency ( The new interval types meet or exceed all function of the existing SQL Interval):
- Add two new DataType implementations for interval year-month and day-second. Includes the JSON format and DLL string.
- Infra support: check the caller sides of DateType/TimestampType
- Support the two new interval types in Dataset/UDF.
- Interval literals (with a legacy config to still allow mixed year-month day-seconds fields and return legacy interval values)
- Interval arithmetic(interval * num, interval / num, interval +/- interval)
- Datetime functions/operators: Datetime - Datetime (to days or day second), Datetime +/- interval
- Cast to and from the new two interval types, cast string to interval, cast interval to string (pretty printing), with the SQL syntax to specify the types
- Support sorting intervals.
Milestone 2 – Persistence:
- Ability to create tables of type interval
- Ability to write to common file formats such as Parquet and JSON.
- INSERT, SELECT, UPDATE, MERGE
Milestone 3 – Client support
- JDBC support
- Hive Thrift server
Milestone 4 – PySpark and Spark R integration
- Python UDF can take and return intervals
- DataFrame support