Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-42635

Several counter-intuitive behaviours in the TimestampAdd expression

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.3.0, 3.3.1, 3.3.2
    • 3.3.3, 3.4.1
    • SQL
    • None

    Description

      1. When the time is close to daylight saving time transition, the result may be discontinuous and not monotonic.

      We currently have:

      scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
      scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, timestamp'2011-03-12 03:00:00')").show
      +------------------------------------------------------------------------+
      |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')|
      +------------------------------------------------------------------------+
      |                                                     2011-03-13 03:59:59|
      +------------------------------------------------------------------------+
      scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 03:00:00')").show
      +------------------------------------------------------------------+
      |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')|
      +------------------------------------------------------------------+
      |                                               2011-03-13 03:00:00|
      +------------------------------------------------------------------+ 

       

      In the second query, adding one more second will set the time back one hour instead. Plus, there are only 23 * 3600 seconds from 2011-03-12 03:00:00 to 2011-03-13 03:00:00, instead of 24 * 3600 seconds, due to the daylight saving time transition.

      The root cause of the problem is the Spark code at https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L790 wrongly assumes every day has MICROS_PER_DAY seconds, and does the day and time-in-day split before looking at the timezone.

      2. Adding month, quarter, and year silently ignores Int overflow during unit conversion.

      The root cause is https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1246. quantity is multiplied by 3 or MONTHS_PER_YEAR without checking overflow. Note that we do have overflow checking in adding the amount to the timestamp, so the behavior is inconsistent.

      This can cause counter-intuitive results like this:

      scala> spark.sql("select timestampadd(quarter, 1431655764, timestamp'1970-01-01')").show
      +------------------------------------------------------------------+
      |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')|
      +------------------------------------------------------------------+
      |                                               1969-09-01 00:00:00|
      +------------------------------------------------------------------+

      3. Adding sub-month units (week, day, hour, minute, second, millisecond, microsecond)silently ignores Long overflow during unit conversion.

      This is similar to the previous problem:

       scala> spark.sql("select timestampadd(day, 106751992, timestamp'1970-01-01')").show(false)
      +-------------------------------------------------------------+
      |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')|
      +-------------------------------------------------------------+
      |-290308-12-22 15:58:10.448384                                |
      +-------------------------------------------------------------+

       

      Attachments

        Activity

          People

            mashplant Chenhao Li
            mashplant Chenhao Li
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: