[ARROW-12121] [Rust] [Parquet] Arrow writer benchmarks - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 4.0.0
Component/s: Rust
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/27944

Description

The common concern with Parquet's Arrow readers and writers is that they're slow.
My diagnosis is that we rely on a chain of processes, which introduces overhead.
For example, writing an Arrow RecordBatch involves the following:

1. Iterate through arrays to create def/rep levels
2. Extract Parquet primitive values from arrays using these levels
3. Write primitive values, validating them in the process (when they already should be validated)
4. Split the already materialised values into small batches for Parquet chunks (consider where we have 1e6 values in a batch)
5. Write these batches, computing the stats of each batch, and encoding values

The above is as a side-effect of convenience, as it would likely require a lot more effort to bypass some of the steps.

I have ideas around going from step 1 to 5 directly, but won't know if it's better if there aren't performance benchmarks. I also struggle to see if I'm making improvements while I clean up the writer code, especially removing the allocations that I created to reduce the complexity of the level calculations.

With ~~ARROW-12120~~ (random array & batch generator), it becomes more convenient to benchmark (and test many combinations of) the Arrow writer.

I would thus like to start adding benchmarks for the Arrow writer.

Attachments

Issue Links

links to

GitHub Pull Request #9825

Activity

People

Assignee:: Neville Dipale

Reporter:: Neville Dipale

Votes:: 1 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 28/Mar/21 07:53

Updated:: 11/Jan/23 08:24

Resolved:: 31/Mar/21 04:59

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: