[PARQUET-2170] Empty projection returns the wrong number of rows when column index is enabled - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: parquet-mr
Labels:
None

Description

Discovered in Spark, when returning an empty projection from a Parquet file with filter pushdown enabled (typically when doing filter + count), Parquet-Mr returns a wrong number of rows with column index enabled. When the column index feature is disabled, the result is correct.

This happens due to the following:

ParquetFileReader::getFilteredRowCount() (https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L851) selects row ranges to calculate the row count when column index is enabled.
In ColumnIndexFilter (https://github.com/apache/parquet-mr/blob/0819356a9dafd2ca07c5eab68e2bffeddc3bd3d9/parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/ColumnIndexFilter.java#L80) we filter row ranges and pass the set of paths which in this case is empty.
When evaluating the filter, if the column path is not in the set, we would return an empty list of rows (https://github.com/apache/parquet-mr/blob/0819356a9dafd2ca07c5eab68e2bffeddc3bd3d9/parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/ColumnIndexFilter.java#L178) which is always the case for an empty projection.
This results in the incorrect number of records reported by the library.

I will provide the full repro later.

Attachments

Issue Links

is related to

SPARK-39833 Filtered parquet data frame count() and show() produce inconsistent results when spark.sql.parquet.filterPushdown is true

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Ivan Sadikov

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 05/Aug/22 04:31

Updated:: 23/Jun/24 03:32