Details
Description
We're seeing incorrect date filters on Parquet files written by Spark 2 or by Spark 3 with legacy rebase mode.
This is the expected behavior that we see in corrected mode (Spark 3.1.2):
Good (Corrected Mode)
>>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED") >>> spark.sql("SELECT DATE '0001-01-01' AS date").write.mode("overwrite").parquet("date_written_by_spark3_corrected") >>> spark.read.parquet("date_written_by_spark3_corrected").selectExpr("date", "date = '0001-01-01'").show() +----------+-------------------+ | date|(date = 0001-01-01)| +----------+-------------------+ |0001-01-01| true| +----------+-------------------+ >>> spark.read.parquet("date_written_by_spark3_corrected").where("date = '0001-01-01'").show() +----------+ | date| +----------+ |0001-01-01| +----------+
This is how we get incorrect results in legacy mode, in this case the filter is dropping rows it shouldn't:
Bad (Legacy Mode)
In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY") >>> spark.sql("SELECT DATE '0001-01-01' AS date").write.mode("overwrite").parquet("date_written_by_spark3_legacy") >>> spark.read.parquet("date_written_by_spark3_legacy").selectExpr("date", "date = '0001-01-01'").show() +----------+-------------------+ | date|(date = 0001-01-01)| +----------+-------------------+ |0001-01-01| true| +----------+-------------------+ >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = '0001-01-01'").show() +----+ |date| +----+ +----+ >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = '0001-01-01'").explain() == Physical Plan == *(1) Filter (isnotnull(date#154) AND (date#154 = -719162)) +- *(1) ColumnarToRow +- FileScan parquet [date#154] Batched: true, DataFilters: [isnotnull(date#154), (date#154 = -719162)], Format: Parquet, Location: InMemoryFileIndex[file:/Volumes/git/spark-installs/spark-3.1.2-bin-hadoop3.2/date_written_by_spar..., PartitionFilters: [], PushedFilters: [IsNotNull(date), EqualTo(date,0001-01-01)], ReadSchema: struct<date:date>