[ARROW-10303] [Rust] Parallel type transformation in CSV reader - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Wish
Status: Closed
Priority: Minor
Resolution: Feedback Received
Affects Version/s: None
Fix Version/s: None
Component/s: Rust
Labels:
- CSVReader

External issue URL:
https://github.com/apache/arrow/issues/26294

Description

Currently, when the CSV file is read, a single thread is responsible for reading the file and for transformation of returned string values into correct data types.

In my case, reading a 2 GB CSV file with a dozen of float columns, takes ~40 seconds. Out of this time, only ~10% of this is reading the file, and ~68% is transformation of the string values into correct data types.

My proposal is to parallelize the part responsible for the data type transformation.

It seems to be quite simple to achieve since after the CSV reader reads a batch, all projected columns are transformed one by one using an iterator over vector and a map function afterwards. I believe that if one uses the rayon crate, the only change will be the adjustment of "iter()" into "par_iter()" and

changing

impl<R: Read> Reader<R>

into:

impl<R: Read + std::marker::Sync> Reader<R>

But maybe I oversee something crucial (as being quite new in Rust and Arrow). Any advise from someone experienced is therefore very welcome!

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

tracing.png
13/Oct/20 20:55
609 kB
Sergej Fries

Issue Links

is related to

ARROW-9707 [Rust] [DataFusion] Re-implement threading model

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Sergej Fries

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 13/Oct/20 21:20

Updated:: 11/Jan/23 08:12

Resolved:: 14/Oct/20 18:11