Description
Write dates with dictionary encoding enabled to parquet files:
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.1.0-SNAPSHOT /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242) Type in expressions to have them evaluated. Type :help for more information. scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true) scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS") scala> :paste // Entering paste mode (ctrl-D to finish) Seq.tabulate(8)(_ => "1001-01-01 01:02:03.123").toDF("tsS") .select($"tsS".cast("timestamp").as("ts")) .repartition(1) .write .option("parquet.enable.dictionary", true) .mode("overwrite") .parquet("/Users/maximgekk/tmp/parquet-ts-dict") // Exiting paste mode, now interpreting.
Read them back:
scala> spark.read.parquet("/Users/maximgekk/tmp/parquet-ts-dict").show(false) +-----------------------+ |ts | +-----------------------+ |1001-01-07 00:32:20.123| |1001-01-07 00:32:20.123| |1001-01-07 00:32:20.123| |1001-01-07 00:32:20.123| |1001-01-07 00:32:20.123| |1001-01-07 00:32:20.123| |1001-01-07 00:32:20.123| |1001-01-07 00:32:20.123| +-----------------------+
Expected values must be 1001-01-01 01:02:03.123.
I checked that the timestamp column is encoded by dictionary via:
➜ parquet-ts-dict java -jar ~/Downloads/parquet-tools-1.12.0.jar dump ./part-00000-2c6c89b1-d165-4528-9a9d-796baa07908e-c000.snappy.parquet row group 0 -------------------------------------------------------------------------------- ts: INT64 SNAPPY DO:0 FPO:4 SZ:94/90/0.96 VC:8 ENC:BIT_PACKED,RLE,PLA [more]... ts TV=8 RL=0 DL=1 DS: 1 DE:PLAIN_DICTIONARY ---------------------------------------------------------------------------- page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY [more]... VC:8 INT64 ts -------------------------------------------------------------------------------- *** row group 1 of 1, values 1 to 8 *** value 1: R:0 D:1 V:1001-01-06T22:02:03.123000+0000 value 2: R:0 D:1 V:1001-01-06T22:02:03.123000+0000 value 3: R:0 D:1 V:1001-01-06T22:02:03.123000+0000 value 4: R:0 D:1 V:1001-01-06T22:02:03.123000+0000 value 5: R:0 D:1 V:1001-01-06T22:02:03.123000+0000 value 6: R:0 D:1 V:1001-01-06T22:02:03.123000+0000 value 7: R:0 D:1 V:1001-01-06T22:02:03.123000+0000 value 8: R:0 D:1 V:1001-01-06T22:02:03.123000+0000
Attachments
Issue Links
- is a clone of
-
SPARK-31662 Reading wrong dates from dictionary encoded columns in Parquet files
- Resolved
- links to