Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-2443

Support lazy materialization of row groups in ParquetFileReader

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      Motivation: The current behavior of ParquetFilterReader#readNextRowGroup is to eagerly enumerate all chunks in the row group, then read all pages in the chunk. For distributed data workloads, this can cause significant memory pressure, particularly for use cases that require the colocation of multiple Parquet files on a single worker.

       

      Proposal: A Parquet Configuration option that enables lazy row group reading, i.e., only a page at a time (plus whatever header is necessary to read that header). The Configuration option could be either a flag, or an int value for how many pages/page bytes to buffer at a time.

       

      I think this could be accomplished by modifying ParquetFileReader#readAllPages to re-implement pagesInChunk as an Iterator<DataPage>, rather than a List<DataPage>. Then, ColumnChunkPageReader could parse the Configuration option above and decide whether to fully materialize the iterator or not.

       

      I'm happy to try to create a draft/branch for this to get some early feedback on the idea!

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              clairemcginty Claire McGinty
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: