[PARQUET-2159] Parquet bit-packing de/encode optimization - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.13.0
Fix Version/s: 1.13.0
Component/s: parquet-mr
Labels:
None

Description

Current Spark use Parquet-mr as parquet reader/writer library, but the built-in bit-packing en/decode is not efficient enough.

Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector in Open JDK18 brings prominent performance improvement.

Due to Vector API is added to OpenJDK since 16, So this optimization request JDK16 or higher.

Below are our test results

Functional test is based on open-source parquet-mr Bit-pack decoding function: public final void unpack8Values(final byte[] in, final int inPos, final int[] out, final int outPos) __

compared with our implementation with vector API public final void unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final int outPos)

We tested 10 pairs (open source parquet bit unpacking vs ours optimized vectorized SIMD implementation) decode function with bit width={1,2,3,4,5,6,7,8,9,10}, below are test results:

We integrated our bit-packing decode implementation into parquet-mr, tested the parquet batch reader ability from Spark VectorizedParquetRecordReader which get parquet column data by the batch way. We construct parquet file with different row count and column count, the column data type is Int32, the maximum int value is 127 which satisfies bit pack encode with bit width=7, the count of the row is from 10k to 100 million and the count of the column is from 1 to 4.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2022-06-15-22-58-40-704.png
15/Jun/22 14:58
65 kB
Fang-Xie
image-2022-06-15-22-58-01-442.png
15/Jun/22 14:58
63 kB
Fang-Xie
image-2022-06-15-22-57-15-964.png
15/Jun/22 14:57
63 kB
Fang-Xie
image-2022-06-15-22-56-08-396.png
15/Jun/22 14:56
62 kB
Fang-Xie

Issue Links

is a parent of

PARQUET-2375 Extend vectorized bit unpacking benchmark for various bit sizes.

Resolved

Sub-Tasks

1.	Parquet java vector decode optimization for Big Endian	Open	Unassigned
2.	Parquet java vector decode optimization Long for Big Endian	Open	Unassigned
3.	Parquet java vector decode optimization Long for Little Endian	Open	Unassigned

Activity

People

Assignee:: Fang-Xie

Reporter:: Fang-Xie

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 15/Jun/22 14:18

Updated:: 10/Nov/23 05:10

Resolved:: 26/Mar/23 05:46