Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.2.1, 3.3.0
Description
One of our data scientists discovered a problem wherein a data frame `.show()` call printed non-empty results, but `.count()` printed 0. I've narrowed the issue to a small, reproducible test case which exhibits this aberrant behavior. In pyspark, run the following code:
from pyspark.sql.types import * parquet_pushdown_bug_df = spark.createDataFrame([{"COL0": int(0)}], schema=StructType(fields=[StructField("COL0",IntegerType(),True)])) parquet_pushdown_bug_df.repartition(1).write.mode("overwrite").parquet("parquet_pushdown_bug/col0=0/parquet_pushdown_bug.parquet") reread_parquet_pushdown_bug_df = spark.read.parquet("parquet_pushdown_bug") reread_parquet_pushdown_bug_df.filter("col0 = 0").show() print(reread_parquet_pushdown_bug_df.filter("col0 = 0").count())
In my usage, this prints a data frame with 1 row and a count of 0. However, disabling `spark.sql.parquet.filterPushdown` produces consistent results:
spark.conf.set("spark.sql.parquet.filterPushdown", False) reread_parquet_pushdown_bug_df.filter("col0 = 0").show() reread_parquet_pushdown_bug_df.filter("col0 = 0").count()
This will print the same data frame, however it will print a count of 1. The key to triggering this bug is not just enabling `spark.sql.parquet.filterPushdown` (which is enabled by default). The case of the column in the data frame (before writing) must differ from the case of the partition column in the file path, i.e. COL0 versus col0 or col0 versus COL0.
Attachments
Issue Links
- is related to
-
SPARK-40169 Fix the issue with Parquet column index and predicate pushdown in Data source V1
- Resolved
- relates to
-
PARQUET-2170 Empty projection returns the wrong number of rows when column index is enabled
- Open
- links to