Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-37544

sequence over dates with month interval is producing incorrect results

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.1, 3.2.0
    • 3.3.0, 3.1.4, 3.2.2
    • SQL
    • Ubuntu 20, OSX 11.6
      OpenJDK 11, Spark 3.2

    Description

      Sequence function with dates and step interval in months producing unexpected results.

      Here is a sample using Spark 3.2 (though the behavior is the same in 3.1.1 and presumably earlier):

      scala> spark.sql("select sequence(date '2021-01-01', date '2022-01-01', interval '3' month) x, date '2021-01-01' + interval '3' month y").collect()
      res1: Array[org.apache.spark.sql.Row] = Array([WrappedArray(2021-01-01, 2021-03-31, 2021-06-30, 2021-09-30, 2022-01-01),2021-04-01])

      Expected result of adding 3 months to the 2021-01-01 is 2021-04-01, while sequence returns 2021-03-31.

      At the same time sequence over timestamps works as expected:

      scala> spark.sql("select sequence(timestamp '2021-01-01 00:00', timestamp '2022-01-01 00:00', interval '3' month) x").collect()
      res2: Array[org.apache.spark.sql.Row] = Array([WrappedArray(2021-01-01 00:00:00.0, *2021-04-01* 00:00:00.0, *2021-07-01* 00:00:00.0, *2021-10-01* 00:00:00.0, 2022-01-01 00:00:00.0)])

       

      A similar issue was reported in the past - SPARK-31654 sequence producing inconsistent intervals for month step - ASF JIRA (apache.org)
      It's marked resolved, but the problem is either resurfaced or was never actually fixed.

      Attachments

        Activity

          People

            bersprockets Bruce Robbins
            seva_ostapenko Vsevolod Ostapenko
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: