[ARROW-5153] [Rust] [Parquet] Use IntoIter trait for write_batch/write_mini_batch - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Invalid
Affects Version/s: None
Fix Version/s: None
Component/s: Rust
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/21633

Description

Writing data to a parquet file requires a lot of copying and intermediate Vec creation. Take a record struct like:

struct MyData { name: String, address: Option<String>}

Over the course of working sets of this data, you'll have the bulk data Vec<MyData>, the names column in a Vec<&String>, the address column in a Vec<Option<String>>. This puts extra memory pressure on the system, at the minimum we have to allocate a Vec the same size as the bulk data even if we are using references.

What I'm proposing is to use an IntoIter style. This will maintain backward compat as a slice automatically implements IntoIter. Where ColumnWriterImpl#write_batch goes from "values: &[T::T]"to values "values: IntoIter<Item=T::T>". Then you can do things like

write_batch(bulk.iter().map(|x| x.name), None, None) write_batch(bulk.iter().map(|x| x.address), Some(bulk.iter().map(|x| x.is_some())), None)

and you can see there's no need for an intermediate Vec, so no short-term allocations to write out the data.

I am writing data with many columns and I think this would really help to speed things up.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Xavier Lange

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 09/Apr/19 19:16

Updated:: 11/Jan/23 07:38

Resolved:: 26/Apr/21 11:23