[SPARK-39833] Filtered parquet data frame count() and show() produce inconsistent results when spark.sql.parquet.filterPushdown is true - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.2.1, 3.3.0
Fix Version/s: 3.3.1, 3.2.3, 3.4.0
Component/s: SQL
Labels:
- correctness

Description

One of our data scientists discovered a problem wherein a data frame `.show()` call printed non-empty results, but `.count()` printed 0. I've narrowed the issue to a small, reproducible test case which exhibits this aberrant behavior. In pyspark, run the following code:

from pyspark.sql.types import *
parquet_pushdown_bug_df = spark.createDataFrame([{"COL0": int(0)}], schema=StructType(fields=[StructField("COL0",IntegerType(),True)]))
parquet_pushdown_bug_df.repartition(1).write.mode("overwrite").parquet("parquet_pushdown_bug/col0=0/parquet_pushdown_bug.parquet")
reread_parquet_pushdown_bug_df = spark.read.parquet("parquet_pushdown_bug")
reread_parquet_pushdown_bug_df.filter("col0 = 0").show()
print(reread_parquet_pushdown_bug_df.filter("col0 = 0").count())

In my usage, this prints a data frame with 1 row and a count of 0. However, disabling `spark.sql.parquet.filterPushdown` produces consistent results:

spark.conf.set("spark.sql.parquet.filterPushdown", False)
reread_parquet_pushdown_bug_df.filter("col0 = 0").show()
reread_parquet_pushdown_bug_df.filter("col0 = 0").count()

This will print the same data frame, however it will print a count of 1. The key to triggering this bug is not just enabling `spark.sql.parquet.filterPushdown` (which is enabled by default). The case of the column in the data frame (before writing) must differ from the case of the partition column in the file path, i.e. COL0 versus col0 or col0 versus COL0.

Attachments

Issue Links

is related to

SPARK-40169 Fix the issue with Parquet column index and predicate pushdown in Data source V1

Resolved

relates to

PARQUET-2170 Empty projection returns the wrong number of rows when column index is enabled

Open

links to

[Github] Pull Request #37419 (sadikovi)

Activity

People

Assignee:: Ivan Sadikov

Reporter:: Michael Allman

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 21/Jul/22 22:05

Updated:: 12/Dec/22 18:10

Resolved:: 21/Aug/22 10:07