Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-35743

Improve Parquet vectorized reader

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 3.3.0
    • None
    • SQL

    Description

      This umbrella JIRA tracks efforts to improve vectorized Parquet reader.

      Attachments

        Issue Links

          1.
          Refactor Parquet vectorized reader to remove duplicated code paths Sub-task Resolved Chao Sun
          2.
          Introduce ParquetReadState to track various states while reading a Parquet column chunk Sub-task Resolved Chao Sun
          3.
          Enable vectorized read for VectorizedPlainValuesReader.readBooleans Sub-task Resolved Kazuyuki Tanimura
          4.
          Combine readBatch and readIntegers in VectorizedRleValuesReader Sub-task Resolved Chao Sun
          5.
          Parquet vectorized reader doesn't skip null values correctly Sub-task Resolved Chao Sun
          6.
          Refactor ParquetColumnIndexSuite Sub-task Resolved Chao Sun
          7.
          Remove ColumnIO once PARQUET-2050 is released in Parquet 1.13 Sub-task Resolved BingKun Pan
          8.
          Implement lazy materialization for the vectorized Parquet reader Sub-task Open Unassigned
          9.
          Implement lazy decoding for the vectorized Parquet reader Sub-task Open Unassigned
          10.
          Decouple CPU with IO work in vectorized Parquet reader Sub-task Open Unassigned
          11.
          Support Parquet v2 data page encodings for the vectorized path Sub-task Resolved Parth Chandra
          12.
          Enhance ParquetSchemaConverter to capture Parquet repetition & definition level Sub-task Resolved Chao Sun
          13.
          Refactor SpecificParquetRecordReaderBase and add more coverage on vectorized Parquet decoding Sub-task Resolved Chao Sun
          14.
          Support Parquet v2 data page RLE encoding (for Boolean Values) for the vectorized path Sub-task Resolved Yang Jie
          15.
          Improve WritableColumnVector to better support null struct Sub-task Resolved Unassigned
          16.
          Enable spark.sql.parquet.enableNestedColumnVectorizedReader on master branch by default Sub-task Resolved Chao Sun
          17.
          Skipping allocating vector for repetition & definition levels when possible Sub-task Resolved Chao Sun

          Activity

            People

              csun Chao Sun
              csun Chao Sun
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated: