Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
3.2.0
Description
the current implementation has a problem. the pages returned by `readNextFilteredRowGroup` may not be aligned, some columns may have more rows than others.
Parquet is using `org.apache.parquet.column.impl.SynchronizingColumnReader` with `rowIndexes` to make sure that rows are aligned.
Currently `VectorizedParquetRecordReader` doesn't have such synchronizing among pages from different columns. Using `readNextFilteredRowGroup` may result in incorrect result.
I have attache an example parquet file. This file is generated with `spark.range(0, 2000).map(i => (i.toLong, i.toInt))` and the layout of this file is listed below.
row group 0
--------------------------------------------------------------------------------
_1: INT64 SNAPPY DO:0 FPO:4 SZ:8161/16104/1.97 VC:2000 ENC:PLAIN,BIT_PACKED [more]...
_2: INT32 SNAPPY DO:0 FPO:8165 SZ:8061/8052/1.00 VC:2000 ENC:PLAIN,BIT_PACKED [more]...
_1 TV=2000 RL=0 DL=0
----------------------------------------------------------------------------
page 0: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for [more]... VC:500
page 1: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for [more]... VC:500
page 2: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for [more]... VC:500
page 3: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for [more]... VC:500
_2 TV=2000 RL=0 DL=0
----------------------------------------------------------------------------
page 0: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for [more]... VC:1000
page 1: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for [more]... VC:1000
As you can see in the row group 0, column1 has 4 data pages each with 500 values and column 2 has 2 data pages with 1000 values each.
If we want to filter the rows by values with _1 = 510 using columnindex, parquet will return the page 1 of column _1 and page 0 of column _2. Page 1 of column _1 starts with row 500, and page 0 of column _2 starts with row 0, and it will be incorrect if we simply read the two values as one row.
As an example, If you try filter with _1 = 510 with column index on in current version, it will give you the wrong result
----+
_1 | _2 |
----+
510 | 10 |
----+
And if turn columnindex off, you can get the correct result
----+
_1 | _2 |
----+
510 | 510 |
----+
Attachments
Attachments
Issue Links
- causes
-
SPARK-36123 Parquet vectorized reader doesn't skip null values correctly
-
- Resolved
-
- is a child of
-
SPARK-35743 Improve Parquet vectorized reader
-
- Resolved
-
- is related to
-
SPARK-26345 Parquet support Column indexes
-
- Resolved
-
- links to