[SPARK-34859] Vectorized parquet reader needs synchronization among pages for column index - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 3.2.0
Fix Version/s: 3.2.0
Component/s: SQL
Labels:
- correctness

Target Version/s:

3.2.0

Description

the current implementation has a problem. the pages returned by `readNextFilteredRowGroup` may not be aligned, some columns may have more rows than others.

Parquet is using `org.apache.parquet.column.impl.SynchronizingColumnReader` with `rowIndexes` to make sure that rows are aligned.

Currently `VectorizedParquetRecordReader` doesn't have such synchronizing among pages from different columns. Using `readNextFilteredRowGroup` may result in incorrect result.

I have attache an example parquet file. This file is generated with `spark.range(0, 2000).map(i => (i.toLong, i.toInt))` and the layout of this file is listed below.
row group 0
--------------------------------------------------------------------------------
_1: INT64 SNAPPY DO:0 FPO:4 SZ:8161/16104/1.97 VC:2000 ENC:PLAIN,BIT_PACKED [more]...
_2: INT32 SNAPPY DO:0 FPO:8165 SZ:8061/8052/1.00 VC:2000 ENC:PLAIN,BIT_PACKED [more]...

_1 TV=2000 RL=0 DL=0
----------------------------------------------------------------------------
page 0: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for [more]... VC:500
page 1: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for [more]... VC:500
page 2: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for [more]... VC:500
page 3: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for [more]... VC:500

_2 TV=2000 RL=0 DL=0
----------------------------------------------------------------------------
page 0: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for [more]... VC:1000
page 1: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[no stats for [more]... VC:1000

As you can see in the row group 0, column1 has 4 data pages each with 500 values and column 2 has 2 data pages with 1000 values each.
If we want to filter the rows by values with _1 = 510 using columnindex, parquet will return the page 1 of column _1 and page 0 of column _2. Page 1 of column _1 starts with row 500, and page 0 of column _2 starts with row 0, and it will be incorrect if we simply read the two values as one row.

As an example, If you try filter with _1 = 510 with column index on in current version, it will give you the wrong result
----+

_1

_2

----+

510

10

----+
And if turn columnindex off, you can get the correct result
----+

_1

_2

----+

510

----+

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

part-00000-bee08cae-04cd-491c-9602-4c66791af3d0-c000.snappy.parquet
26/May/21 03:21
17 kB
Li Xian

Issue Links

causes

SPARK-36123 Parquet vectorized reader doesn't skip null values correctly

Resolved

is a child of

SPARK-35743 Improve Parquet vectorized reader

Resolved

is related to

SPARK-26345 Parquet support Column indexes

Resolved

links to

[Github] Pull Request #31998 (lxian)

[Github] Pull Request #32753 (sunchao)

(2 links to)

Vectorized parquet reader needs synchronization among pages for column index

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates