Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-2170

Empty projection returns the wrong number of rows when column index is enabled

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • parquet-mr
    • None

    Description

      Discovered in Spark, when returning an empty projection from a Parquet file with filter pushdown enabled (typically when doing filter + count), Parquet-Mr returns a wrong number of rows with column index enabled. When the column index feature is disabled, the result is correct.

       

      This happens due to the following:

      1. ParquetFileReader::getFilteredRowCount() (https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L851) selects row ranges to calculate the row count when column index is enabled.
      2. In ColumnIndexFilter (https://github.com/apache/parquet-mr/blob/0819356a9dafd2ca07c5eab68e2bffeddc3bd3d9/parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/ColumnIndexFilter.java#L80) we filter row ranges and pass the set of paths which in this case is empty.
      3. When evaluating the filter, if the column path is not in the set, we would return an empty list of rows (https://github.com/apache/parquet-mr/blob/0819356a9dafd2ca07c5eab68e2bffeddc3bd3d9/parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/ColumnIndexFilter.java#L178) which is always the case for an empty projection.
      4. This results in the incorrect number of records reported by the library.

      I will provide the full repro later.

       

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              ivan.sadikov Ivan Sadikov
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: