[ARROW-10995] [Rust] [DataFusion] Improve parallelism when reading Parquet files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.0.0
Component/s: Rust - DataFusion
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/26916

Description

Currently the unit of parallelism is the number of parquet files being read.

For example, if we run a query against a Parquet table that consists of 8 partitions then we will attempt to run 8 async tasks in parallel and if there is a single Parquet file then we will only try and run 1 async task so this does not scale well. Also, if there are hundreds or thousands of Parquet files then we will try and process them all concurrently which also doesn't scale well.

These are the options for improving this situation:

Use Parquet row groups as the unit of partitioning and divide the number of row groups by the desired level of concurrency (defaulting to number of cores)
Keep file as the unit of partitions and add a RepartitionExec into the plan if there are fewer partitions (files) than cores and in the case where there are more files than cores, split the files up into lists so that each partition is a list of files rather than a single file. Each partition task will process one file at a time.

Attachments

Issue Links

is blocked by

ARROW-11016 [Rust] Parquet ArrayReader should allow reading a subset of row groups

Closed

links to

GitHub Pull Request #9029

Activity

People

Assignee:: Andy Grove

Reporter:: Andy Grove

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 20/Dec/20 23:07

Updated:: 11/Jan/23 08:16

Resolved:: 29/Dec/20 16:30

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2h 20m