Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-39833

Filtered parquet data frame count() and show() produce inconsistent results when spark.sql.parquet.filterPushdown is true

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.2.1, 3.3.0
    • 3.3.1, 3.2.3, 3.4.0
    • SQL

    Description

      One of our data scientists discovered a problem wherein a data frame `.show()` call printed non-empty results, but `.count()` printed 0. I've narrowed the issue to a small, reproducible test case which exhibits this aberrant behavior. In pyspark, run the following code:

      from pyspark.sql.types import *
      parquet_pushdown_bug_df = spark.createDataFrame([{"COL0": int(0)}], schema=StructType(fields=[StructField("COL0",IntegerType(),True)]))
      parquet_pushdown_bug_df.repartition(1).write.mode("overwrite").parquet("parquet_pushdown_bug/col0=0/parquet_pushdown_bug.parquet")
      reread_parquet_pushdown_bug_df = spark.read.parquet("parquet_pushdown_bug")
      reread_parquet_pushdown_bug_df.filter("col0 = 0").show()
      print(reread_parquet_pushdown_bug_df.filter("col0 = 0").count())
      

      In my usage, this prints a data frame with 1 row and a count of 0. However, disabling `spark.sql.parquet.filterPushdown` produces consistent results:

      spark.conf.set("spark.sql.parquet.filterPushdown", False)
      reread_parquet_pushdown_bug_df.filter("col0 = 0").show()
      reread_parquet_pushdown_bug_df.filter("col0 = 0").count()
      

      This will print the same data frame, however it will print a count of 1. The key to triggering this bug is not just enabling `spark.sql.parquet.filterPushdown` (which is enabled by default). The case of the column in the data frame (before writing) must differ from the case of the partition column in the file path, i.e. COL0 versus col0 or col0 versus COL0.

      Attachments

        Issue Links

          Activity

            People

              ivan.sadikov Ivan Sadikov
              msa Michael Allman
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: