Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1927

ColumnIndex should provide number of records skipped

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.11.0
    • None
    • parquet-mr
    • None

    Description

      When integrating Parquet ColumnIndex, I found we need to know from Parquet that how many records that we skipped due to ColumnIndex filtering. When rowCount is 0, readNextFilteredRowGroup() just advance to next without telling the caller. See code here https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969

       

      In Iceberg, it reads Parquet record with an iterator. The hasNext() has the following code():

      valuesRead + skippedValues < totalValues

      See (https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115). 

      So without knowing the skipped values, it is hard to determine hasNext() or not. 

       

      Currently, we can workaround by using a flag. When readNextFilteredRowGroup() returns null, we consider it is done for the whole file. Then hasNext() just retrun false. 

       

       

       

      Attachments

        Activity

          People

            shangx@uber.com Xinli Shang
            shangxinli Xinli Shang
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: