Description
Write dates/timestamps to Parquet file in Spark 2.4:
$ export TZ="UTC" $ ~/spark-2.4/bin/spark-shell Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.5 /_/ Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_231) Type in expressions to have them evaluated. Type :help for more information. scala> spark.conf.set("spark.sql.session.timeZone", "UTC") scala> val df = Seq(("1001-01-01", "1001-01-01 01:02:03.123456")).toDF("dateS", "tsS").select($"dateS".cast("date").as("d"), $"tsS".cast("timestamp").as("ts")) df: org.apache.spark.sql.DataFrame = [d: date, ts: timestamp] scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros") scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS") scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros") scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros").show(false) +----------+--------------------------+ |d |ts | +----------+--------------------------+ |1001-01-01|1001-01-01 01:02:03.123456| +----------+--------------------------+
Spark 2.4 saves dates/timestamps in Julian calendar. The parquet-mr tool prints 1001-01-07 and 1001-01-07T01:02:03.123456+0000:
$ java -jar /Users/maxim/proj/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar dump -m ./2_4_5_micros/part-00000-fe310bfa-0f61-44af-85ee-489721042c14-c000.snappy.parquet INT32 d -------------------------------------------------------------------------------- *** row group 1 of 1, values 1 to 1 *** value 1: R:0 D:1 V:1001-01-07 INT64 ts -------------------------------------------------------------------------------- *** row group 1 of 1, values 1 to 1 *** value 1: R:0 D:1 V:1001-01-07T01:02:03.123456+0000
Spark 3.0.0-preview2 ( and 3.1.0-SNAPSHOT) prints the same as parquet-mr but different values from Spark 2.4:
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-preview2 /_/ Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_231) scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros").show(false) +----------+--------------------------+ |d |ts | +----------+--------------------------+ |1001-01-07|1001-01-07 01:02:03.123456| +----------+--------------------------+
Attachments
Issue Links
- is related to
-
SPARK-31296 Benchmark date-time rebasing in Parquet datasource
- Resolved
-
SPARK-31318 Split Parquet/Avro configs for rebasing dates/timestamps in read and in write
- Resolved
- links to