[SPARK-10177] Parquet support interprets timestamp values differently from Hive 0.14.0+ - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.5.0
Fix Version/s: 1.5.0
Component/s: SQL
Labels:
None

Target Version/s:

1.5.0
Sprint:
Spark 1.5 doc/QA sprint

Description

Running the following SQL under Hive 0.14.0+ (tested against 0.14.0 and 1.2.1):

CREATE TABLE ts_test STORED AS PARQUET
AS SELECT CAST("2015-01-01 00:00:00" AS TIMESTAMP);

Then read the Parquet file generated by Hive with Spark SQL:

scala> sqlContext.read.parquet("hdfs://localhost:9000/user/hive/warehouse_hive14/ts_test").collect()
res1: Array[org.apache.spark.sql.Row] = Array([2015-01-01 12:00:00.0])

This issue can be easily reproduced with this test case in PR #8392.

Spark 1.4.1 works as expected in this case.

Update:

Seems that the problem is that we do Julian day conversion wrong in DateTimeUtils. The following spark-shell session illustrates it:

import java.sql._
import java.util._
import org.apache.hadoop.hive.ql.io.parquet.timestamp._
import org.apache.spark.sql.catalyst.util._

TimeZone.setDefault(TimeZone.getTimeZone("GMT"))
val ts = Timestamp.valueOf("1970-01-01 00:00:00")
val nt = NanoTimeUtils.getNanoTime(ts, false)
val jts = DateTimeUtils.fromJulianDay(nt.getJulianDay, nt.getTimeOfDayNanos)
DateTimeUtils.toJavaTimestamp(jts)

// ==> java.sql.Timestamp = 1970-01-01 12:00:00.0

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

000000_0
24/Aug/15 08:34
0.2 kB
Cheng Lian

Issue Links

links to

[Github] Pull Request #8400 (davies)

Activity

People

Assignee:: Davies Liu

Reporter:: Cheng Lian

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 24/Aug/15 04:42

Updated:: 25/Aug/15 08:01

Resolved:: 25/Aug/15 08:01