Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
Discovered in Spark, when returning an empty projection from a Parquet file with filter pushdown enabled (typically when doing filter + count), Parquet-Mr returns a wrong number of rows with column index enabled. When the column index feature is disabled, the result is correct.
This happens due to the following:
- ParquetFileReader::getFilteredRowCount() (https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L851) selects row ranges to calculate the row count when column index is enabled.
- In ColumnIndexFilter (https://github.com/apache/parquet-mr/blob/0819356a9dafd2ca07c5eab68e2bffeddc3bd3d9/parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/ColumnIndexFilter.java#L80) we filter row ranges and pass the set of paths which in this case is empty.
- When evaluating the filter, if the column path is not in the set, we would return an empty list of rows (https://github.com/apache/parquet-mr/blob/0819356a9dafd2ca07c5eab68e2bffeddc3bd3d9/parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/ColumnIndexFilter.java#L178) which is always the case for an empty projection.
- This results in the incorrect number of records reported by the library.
I will provide the full repro later.
Attachments
Issue Links
- is related to
-
SPARK-39833 Filtered parquet data frame count() and show() produce inconsistent results when spark.sql.parquet.filterPushdown is true
- Resolved