Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-35743

Improve Parquet vectorized reader

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 3.3.0
    • None
    • SQL

    Description

      This umbrella JIRA tracks efforts to improve vectorized Parquet reader.

      Attachments

        Issue Links

        1.
        Refactor Parquet vectorized reader to remove duplicated code paths Sub-task Resolved Chao Sun Actions
        2.
        Introduce ParquetReadState to track various states while reading a Parquet column chunk Sub-task Resolved Chao Sun Actions
        3.
        Enable vectorized read for VectorizedPlainValuesReader.readBooleans Sub-task Resolved Kazuyuki Tanimura Actions
        4.
        Combine readBatch and readIntegers in VectorizedRleValuesReader Sub-task Resolved Chao Sun Actions
        5.
        Parquet vectorized reader doesn't skip null values correctly Sub-task Resolved Chao Sun Actions
        6.
        Refactor ParquetColumnIndexSuite Sub-task Resolved Chao Sun Actions
        7.
        Remove ColumnIO once PARQUET-2050 is released in Parquet 1.13 Sub-task Open Unassigned Actions
        8.
        Implement lazy materialization for the vectorized Parquet reader Sub-task Open Unassigned Actions
        9.
        Implement lazy decoding for the vectorized Parquet reader Sub-task Open Unassigned Actions
        10.
        Decouple CPU with IO work in vectorized Parquet reader Sub-task Open Unassigned Actions
        11.
        Support Parquet v2 data page encodings for the vectorized path Sub-task Resolved Parth Chandra Actions
        12.
        Enhance ParquetSchemaConverter to capture Parquet repetition & definition level Sub-task Resolved Chao Sun Actions
        13.
        Refactor SpecificParquetRecordReaderBase and add more coverage on vectorized Parquet decoding Sub-task Resolved Chao Sun Actions
        14.
        Support Parquet v2 data page RLE encoding (for Boolean Values) for the vectorized path Sub-task Resolved Yang Jie Actions
        15.
        Improve WritableColumnVector to better support null struct Sub-task Resolved Unassigned Actions
        16.
        Enable spark.sql.parquet.enableNestedColumnVectorizedReader on master branch by default Sub-task Resolved Chao Sun Actions
        17.
        Skipping allocating vector for repetition & definition levels when possible Sub-task Resolved Chao Sun Actions

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            csun Chao Sun
            csun Chao Sun

            Dates

              Created:
              Updated:

              Slack

                Issue deployment