Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1602

PageIndex not working as suggested ?

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 1.11.0
    • None
    • parquet-mr
    • None

    Description

      I have a schema such as:

      schema message pages {
        required binary url (STRING);
        optional binary content (STRING);
      }
      

      Where `url` is unique and ordered, the file is created in such a way that I have ~600 pages of `content` for 1 page of `url`.

       

      From https://github.com/apache/parquet-format/blob/master/PageIndex.md I saw:

      A single-row lookup in a rowgroup based on the sort column of that rowgroup will only read one data page per retrieved column.

      I was expecting `ParquetReader`  to find the matching row thanks to the `FilterPredicate`  on `url`, decoding only this column, then, using `offset index`, directly seek to the appropriate page for `content` and decode it.

      Instead, what I'm seeing, is that the reader fully reads & decode the ~600 pages of content (until it actually find the url).

      Is there something I misunderstood or some step to ensure to make the reader only consume the necessary pages?

       

       

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            panthony Anthony Pessy
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: