[PARQUET-1927] ColumnIndex should provide number of records skipped - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.11.0
Fix Version/s: None
Component/s: parquet-mr
Labels:
None

Description

When integrating Parquet ColumnIndex, I found we need to know from Parquet that how many records that we skipped due to ColumnIndex filtering. When rowCount is 0, readNextFilteredRowGroup() just advance to next without telling the caller. See code here https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969

In Iceberg, it reads Parquet record with an iterator. The hasNext() has the following code():

valuesRead + skippedValues < totalValues

See (https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).

So without knowing the skipped values, it is hard to determine hasNext() or not.

Currently, we can workaround by using a flag. When readNextFilteredRowGroup() returns null, we consider it is done for the whole file. Then hasNext() just retrun false.

Attachments

Activity

People

Assignee:: Xinli Shang

Reporter:: Xinli Shang

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 17/Oct/20 14:42

Updated:: 23/Jun/24 03:32