Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10177

Parquet support interprets timestamp values differently from Hive 0.14.0+

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 1.5.0
    • 1.5.0
    • SQL
    • None
    • Spark 1.5 doc/QA sprint

    Description

      Running the following SQL under Hive 0.14.0+ (tested against 0.14.0 and 1.2.1):

      CREATE TABLE ts_test STORED AS PARQUET
      AS SELECT CAST("2015-01-01 00:00:00" AS TIMESTAMP);
      

      Then read the Parquet file generated by Hive with Spark SQL:

      scala> sqlContext.read.parquet("hdfs://localhost:9000/user/hive/warehouse_hive14/ts_test").collect()
      res1: Array[org.apache.spark.sql.Row] = Array([2015-01-01 12:00:00.0])
      

      This issue can be easily reproduced with this test case in PR #8392.

      Spark 1.4.1 works as expected in this case.


      Update:

      Seems that the problem is that we do Julian day conversion wrong in DateTimeUtils. The following spark-shell session illustrates it:

      import java.sql._
      import java.util._
      import org.apache.hadoop.hive.ql.io.parquet.timestamp._
      import org.apache.spark.sql.catalyst.util._
      
      TimeZone.setDefault(TimeZone.getTimeZone("GMT"))
      val ts = Timestamp.valueOf("1970-01-01 00:00:00")
      val nt = NanoTimeUtils.getNanoTime(ts, false)
      val jts = DateTimeUtils.fromJulianDay(nt.getJulianDay, nt.getTimeOfDayNanos)
      DateTimeUtils.toJavaTimestamp(jts)
      
      // ==> java.sql.Timestamp = 1970-01-01 12:00:00.0
      

      Attachments

        1. 000000_0
          0.2 kB
          Cheng Lian

        Activity

          People

            davies Davies Liu
            lian cheng Cheng Lian
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: