Description
Write dates with dictionary encoding enabled to parquet files:
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.1.0-SNAPSHOT /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242) Type in expressions to have them evaluated. Type :help for more information. scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true) scala> :paste // Entering paste mode (ctrl-D to finish) Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS") .select($"dateS".cast("date").as("date")) .repartition(1) .write .option("parquet.enable.dictionary", true) .mode("overwrite") .parquet("/Users/maximgekk/tmp/parquet-date-dict") // Exiting paste mode, now interpreting.
Read them back:
scala> spark.read.parquet("/Users/maximgekk/tmp/parquet-date-dict").show(false) +----------+ |date | +----------+ |1001-01-07| |1001-01-07| |1001-01-07| |1001-01-07| |1001-01-07| |1001-01-07| |1001-01-07| |1001-01-07| +----------+
Expected values must be 1000-01-01.
I checked that the date column is encoded by dictionary via:
➜ parquet-date-dict java -jar ~/Downloads/parquet-tools-1.12.0.jar dump ./part-00000-84a77214-0c8c-45e9-ac41-5ca863b9dd94-c000.snappy.parquet row group 0 -------------------------------------------------------------------------------- date: INT32 SNAPPY DO:0 FPO:4 SZ:74/70/0.95 VC:8 ENC:BIT_PACKED,RLE,P [more]... date TV=8 RL=0 DL=1 DS: 1 DE:PLAIN_DICTIONARY ---------------------------------------------------------------------------- page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY [more]... VC:8 INT32 date -------------------------------------------------------------------------------- *** row group 1 of 1, values 1 to 8 *** value 1: R:0 D:1 V:1001-01-07 value 2: R:0 D:1 V:1001-01-07 value 3: R:0 D:1 V:1001-01-07 value 4: R:0 D:1 V:1001-01-07 value 5: R:0 D:1 V:1001-01-07 value 6: R:0 D:1 V:1001-01-07 value 7: R:0 D:1 V:1001-01-07 value 8: R:0 D:1 V:1001-01-07
Attachments
Issue Links
- is cloned by
-
SPARK-31672 Reading wrong timestamps from dictionary encoded columns in Parquet files
- Resolved
- links to