Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-36034

Incorrect datetime filter when reading Parquet files written in legacy mode

    XMLWordPrintableJSON

Details

    Description

      We're seeing incorrect date filters on Parquet files written by Spark 2 or by Spark 3 with legacy rebase mode.

      This is the expected behavior that we see in corrected mode (Spark 3.1.2):

      Good (Corrected Mode)
      >>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")
      
      >>> spark.sql("SELECT DATE '0001-01-01' AS date").write.mode("overwrite").parquet("date_written_by_spark3_corrected")
      
      >>> spark.read.parquet("date_written_by_spark3_corrected").selectExpr("date", "date = '0001-01-01'").show()
      +----------+-------------------+
      |      date|(date = 0001-01-01)|
      +----------+-------------------+
      |0001-01-01|               true|
      +----------+-------------------+
      
      >>> spark.read.parquet("date_written_by_spark3_corrected").where("date = '0001-01-01'").show()
      +----------+
      |      date|
      +----------+
      |0001-01-01|
      +----------+
      

      This is how we get incorrect results in legacy mode, in this case the filter is dropping rows it shouldn't:

      Bad (Legacy Mode)
      In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY")
      
      >>> spark.sql("SELECT DATE '0001-01-01' AS date").write.mode("overwrite").parquet("date_written_by_spark3_legacy")
      
      >>> spark.read.parquet("date_written_by_spark3_legacy").selectExpr("date", "date = '0001-01-01'").show()
      +----------+-------------------+
      |      date|(date = 0001-01-01)|
      +----------+-------------------+
      |0001-01-01|               true|
      +----------+-------------------+
      
      >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = '0001-01-01'").show()
      +----+
      |date|
      +----+
      +----+
      
      >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = '0001-01-01'").explain()
      == Physical Plan ==
      *(1) Filter (isnotnull(date#154) AND (date#154 = -719162))
      +- *(1) ColumnarToRow
         +- FileScan parquet [date#154] Batched: true, DataFilters: [isnotnull(date#154), (date#154 = -719162)], Format: Parquet, Location: InMemoryFileIndex[file:/Volumes/git/spark-installs/spark-3.1.2-bin-hadoop3.2/date_written_by_spar..., PartitionFilters: [], PushedFilters: [IsNotNull(date), EqualTo(date,0001-01-01)], ReadSchema: struct<date:date>
      

      Attachments

        Activity

          People

            maxgekk Max Gekk
            rshkv Willi Raschkowski
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: