Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-15834 Time zone / locale sensitivity umbrella
  3. SPARK-11415

Catalyst DateType Shifts Input Data by Local Timezone

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Closed
    • Major
    • Resolution: Not A Problem
    • 1.5.0, 1.5.1
    • None
    • SQL
    • None

    Description

      I've been running type tests for the Spark Cassandra Connector and couldn't get a consistent result for java.sql.Date. I investigated and noticed the following code is used to create Catalyst.DateTypes

      https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144

       /**
         * Returns the number of days since epoch from from java.sql.Date.
         */
        def fromJavaDate(date: Date): SQLDate = {
          millisToDays(date.getTime)
        }
      

      But millisToDays does not abide by this contract, shifting the underlying timestamp to the local timezone before calculating the days from epoch. This causes the invocation to move the actual date around.

        // we should use the exact day as Int, for example, (year, month, day) -> day
        def millisToDays(millisUtc: Long): SQLDate = {
          // SPARK-6785: use Math.floor so negative number of days (dates before 1970)
          // will correctly work as input for function toJavaDate(Int)
          val millisLocal = millisUtc + threadLocalLocalTimeZone.get().getOffset(millisUtc)
          Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
        }
      

      The inverse function also incorrectly shifts the timezone

        // reverse of millisToDays
        def daysToMillis(days: SQLDate): Long = {
          val millisUtc = days.toLong * MILLIS_PER_DAY
          millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc)
        }
      
      

      https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93

      This will cause 1-off errors and could cause significant shifts in data if the underlying data is worked on in different timezones than UTC.

      Attachments

        Activity

          People

            Unassigned Unassigned
            rspitzer Russell Spitzer
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: