Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-2170

Empty projection returns the wrong number of rows when column index is enabled

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • parquet-mr
    • None

    Description

      Discovered in Spark, when returning an empty projection from a Parquet file with filter pushdown enabled (typically when doing filter + count), Parquet-Mr returns a wrong number of rows with column index enabled. When the column index feature is disabled, the result is correct.

       

      This happens due to the following:

      1. ParquetFileReader::getFilteredRowCount() (https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L851) selects row ranges to calculate the row count when column index is enabled.
      2. In ColumnIndexFilter (https://github.com/apache/parquet-mr/blob/0819356a9dafd2ca07c5eab68e2bffeddc3bd3d9/parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/ColumnIndexFilter.java#L80) we filter row ranges and pass the set of paths which in this case is empty.
      3. When evaluating the filter, if the column path is not in the set, we would return an empty list of rows (https://github.com/apache/parquet-mr/blob/0819356a9dafd2ca07c5eab68e2bffeddc3bd3d9/parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/ColumnIndexFilter.java#L178) which is always the case for an empty projection.
      4. This results in the incorrect number of records reported by the library.

      I will provide the full repro later.

       

       

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned Assign to me
            ivan.sadikov Ivan Sadikov

            Dates

              Created:
              Updated:

              Slack

                Issue deployment