Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34859

Vectorized parquet reader needs synchronization among pages for column index

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 3.2.0
    • 3.2.0
    • SQL

    Description

      the current implementation has a problem. the pages returned by `readNextFilteredRowGroup` may not be aligned, some columns may have more rows than others.

      Parquet is using `org.apache.parquet.column.impl.SynchronizingColumnReader` with `rowIndexes` to make sure that rows are aligned. 

      Currently `VectorizedParquetRecordReader` doesn't have such synchronizing among pages from different columns. Using `readNextFilteredRowGroup` may result in incorrect result.

       
      I have attache an example parquet file. This file is generated with `spark.range(0, 2000).map(i => (i.toLong, i.toInt))` and the layout of this file is listed below.
      row group 0
      --------------------------------------------------------------------------------
      _1:  INT64 SNAPPY DO:0 FPO:4 SZ:8161/16104/1.97 VC:2000 ENC:PLAIN,BIT_PACKED [more]...
      _2:  INT32 SNAPPY DO:0 FPO:8165 SZ:8061/8052/1.00 VC:2000 ENC:PLAIN,BIT_PACKED [more]...

          _1 TV=2000 RL=0 DL=0
          ----------------------------------------------------------------------------
          page 0:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  [more]... VC:500
          page 1:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  [more]... VC:500
          page 2:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  [more]... VC:500
          page 3:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  [more]... VC:500

          _2 TV=2000 RL=0 DL=0
          ----------------------------------------------------------------------------
          page 0:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  [more]... VC:1000
          page 1:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for  [more]... VC:1000
       
      As you can see in the row group 0, column1 has 4 data pages each with 500 values and column 2 has 2 data pages with 1000 values each. 
      If we want to filter the rows by values with _1 = 510 using columnindex, parquet will return the page 1 of column _1 and page 0 of column _2. Page 1 of column _1 starts with row 500, and page 0 of column _2 starts with row 0, and it will be incorrect if we simply read the two values as one row.
       
      As an example, If you try filter with  _1 = 510 with column index on in current version, it will give you the wrong result
      ----+

      _1 _2

      ----+

      510 10

      ----+
      And if turn columnindex off, you can get the correct result
      ----+

      _1 _2

      ----+

      510 510

      ----+
       

      Attachments

        Issue Links

          Activity

            People

              csun Chao Sun
              lxian2 Li Xian
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: