[PARQUET-2149] Implement async IO for Parquet file reader - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: parquet-mr
Labels:
None

Description

ParquetFileReader's implementation has the following flow (simplified) -
- For every column -> Read from storage in 8MB blocks -> Read all uncompressed pages into output queue
- From output queues -> (downstream ) decompression + decoding

This flow is serialized, which means that downstream threads are blocked until the data has been read. Because a large part of the time spent is waiting for data from storage, threads are idle and CPU utilization is really low.

There is no reason why this cannot be made asynchronous and parallel. So

For Column i -> reading one chunk until end, from storage -> intermediate output queue -> read one uncompressed page until end -> output queue -> (downstream ) decompression + decoding

Note that this can be made completely self contained in ParquetFileReader and downstream implementations like Iceberg and Spark will automatically be able to take advantage without code change as long as the ParquetFileReader apis are not changed.

In past work with async io Drill - async page reader , I have seen 2x-3x improvement in reading speed for Parquet files.

Attachments

Issue Links

is depended upon by

PARQUET-2486 Improve Parquet IO Performance within cloud datalakes

In Progress

links to

GitHub Pull Request #968

Activity

People

Assignee:: Unassigned

Reporter:: Parth Chandra

Votes:: 3 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 16/May/22 18:26

Updated:: 23/Jun/24 03:32