[HIVE-26528] TIMESTAMP stored via spark-shell DataFrame to Avro returns incorrect value when read using HiveCLI - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.1.2
Fix Version/s: None
Component/s: Serializers/Deserializers
Labels:
None

Description

Describe the bug

We are trying to store a TIMESTAMP "2022" to a table created via Spark DataFrame. The table is created with the Avro file format. We encounter no errors while creating the table and inserting the aforementioned timestamp value. However, performing a SELECT query on the table through HiveCLI returns an incorrect value: "+53971-10-02 19:00:0000"

The root cause for this issue is the fact that Spark's AvroSerializer serializes timestamps using Avro's TIMESTAMP_MICRO while Hive's AvroDeserializer assumes timestamps to be Avro's TIMESTAMP_MILLIS during deserialization.

Step to reproduce

On Spark 3.2.1 (commit `4f25b3f712`), using `spark-shell` with the Avro package:

./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1

Execute the following:

import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
val rdd = sc.parallelize(Seq(Row(Seq("2022").toDF("time").select(to_timestamp(col("time")).as("to_timestamp")).first().getAs[java.sql.Timestamp](0))))
val schema = new StructType().add(StructField("c1", TimestampType, true))
val df = spark.createDataFrame(rdd, schema)
df.show(false)
df.write.mode("overwrite").format("avro").saveAsTable("ws")

On Hive 3.1.2, execute the following:

hive> select * from ws;
OK
+53971-10-02 19:00:0000

Expected behavior

We expect the output of the SELECT query to be "2022-01-01 00:00:00".We tried other formats like Parquet and the outcome is consistent with this expectation. Moreover, the timestamp is interpreted correctly when the table is written to via DataFrame and read via spark-shell/spark-sql:

Can be read correctly from spark-shell:

scala> spark.sql("select * from ws;").show(false)
+-------------------+
|c1                 |
+-------------------+
|2022-01-01 00:00:00|
+-------------------+

Can be read correctly from spark-sql:

spark-sql> select * from ws;
2022-01-01 00:00:00
Time taken: 0.063 seconds, Fetched 1 row(s)

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: xsys

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 09/Sep/22 05:05

Updated:: 09/Sep/22 23:30