While testing Spark SQL 1.5-SNAPSHOT for Parquet/Hive compatibility, we hit
SPARK-10177. In short, Spark SQL and Hive both have their own Julian date conversion code, and their results don't match. Currently, we've fixed this issue by making Spark SQL behave the same as Hive so that we can interoperate (see Spark PR #8400). However, Hive's behavior looks a little bit weird to me: when converting a calendar timestamp to a Julian timestamp, Hive always gives a result 12 hours later than the expected result.
This behavior can be verified by the following Spark 1.5-SNAPSHOT shell snippet (I'm using Spark 1.5-SNAPSHOT shell since it comes with Hive 1.2.1 dependencies):
val timestamp = Timestamp.valueOf("1970-00-00 00:00:00")
val hiveNanoTime = NanoTimeUtils.getNanoTime(timestamp, false)
val hiveJulianDay = hiveNanoTime.getJulianDay
val hiveTimeOfDayNanos = hiveNanoTime.getTimeOfDayNanos
s"""Hive converts "$timestamp" to Julian timestamp:
The result is:
Hive converts "1970-01-01 00:00:00.0" to Julian timestamp:
According to definition on this page, Julian dates count from noon. Namely "00:00:00" of any calendar date must map to a Julian timestamp with a fraction of 0.5, i.e. an integral date plus 12 hours. And the correct Julian timestamp given by the converter in the aforementioned page is "2440587.500000", which is equivalent to:
This means, INT96 timestamp values stored in Parquet files written by Hive all have a 12 hr offset. (I haven't tried to verify this issue against Impala.)
This shouldn't a big problem though, as long as the read path always correctly decode written timestamp values. Just curious, is this 12 hr offset intentional?