[ARROW-8421] [Rust] [Parquet] Implement parquet writer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Invalid
Affects Version/s: None
Fix Version/s: 5.0.0
Component/s: Rust
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/24603

Description

This is the parent story. See subtasks for more information.

Notes from wesm :

A couple of initial things to keep in mind

Writes of both Nullable (OPTIONAL) and non-nullable (REQUIRED) fields
You can optimize the special case where a nullable field's data has no nulls
A good amount of code is required to handle converting from the Arrow physical form of various logical types to the Parquet equivalent one, see https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc for details
It would be worth thinking up front about how dictionary-encoded data is handled both on the Arrow write and Arrow read paths. In parquet-cpp we initially discarded Arrow DictionaryArrays on write (casting e.g. Dictionary to dense String), and through real world need I was forced to revisit this (quite painfully) to enable Arrow dictionaries to survive roundtrips to Parquet format, and also achieve better performance and memory use in both reads and writes. You can certainly do a dictionary-to-dense conversion like we did, but you may someday find yourselves doing the same painful refactor that I did to make dictionary write and read not only more efficient but also dictionary order preserving.

Notes from sunchao :

I roughly skimmed through the C++ implementation and think on the high level we need to do the following:

implement a method similar to WriteArrow in column_writer.cc. We can further break this up into smaller pieces such as: dictionary/non-dictionary, primitive types, booleans, timestamps, dates, so on and so forth.
implement an arrow writer in the parquet crate here. This needs to offer similar APIs as writer.h.

Attachments

Issue Links

links to

GitHub Pull Request #8274

GitHub Pull Request #8548

Sub-Tasks

1.

[Rust] [Parquet] Implement minimal Arrow Parquet writer as starting point for full writer

Resolved

Andy Grove

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 5h 40m

2.

[Rust] [Parquet] Implement function to convert Arrow schema to Parquet schema

Resolved

Neville Dipale

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 0.5h

3.

[Rust] [Parquet] Serialize arrow schema into metadata when writing parquet

Resolved

Neville Dipale

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 40m

4.

[Rust] [Parquet] Add support for writing sliced arrays

Closed

Unassigned

5.

[Rust] [Parquet] Add support for writing temporal types

Resolved

Neville Dipale

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 1h 10m

6.

[Rust] [Parquet] Add support for writing dictionary types

Resolved

Carol Nichols

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 7h 50m

7.

[Rust] [Parquet] Compute nested definition and repetition for structs

Resolved

Neville Dipale

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 3h 20m

8.

[Rust] [Parquet] Update for IPC changes

Resolved

Carol Nichols

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 50m

9.

[Rust] [Parquet] Extend arrow schema conversion to projected fields

Resolved

Carol Nichols

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 1h 20m

10.

[Rust] [Parquet] Add roundtrip tests for single column batches

Resolved

Carol Nichols

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 1h 20m

11.

[Rust] [Parquet] Fix null bitmap comparisons in roundtrip tests

Resolved

Neville Dipale

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 1h 10m

12.

[Rust] [Parquet] Support reading and writing Arrow NullArray

Resolved

Neville Dipale

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 40m

13.

[Rust] [Parquet] Write nested types (struct, list)

Closed

Unassigned

14.

[Rust] [Parquet] Add support for writing boolean type

Resolved

Will Jones

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 50m

15.

[Rust] Compute nested definition and repetition for list arrays

Resolved

Neville Dipale

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 4h 20m

16.

[Rust] [Parquet] Write fixed size binary arrays

Resolved

Neville Dipale

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 1h 20m

Activity

People

Assignee:: Neville Dipale

Reporter:: Andy Grove

Votes:: 1 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 13/Apr/20 14:36

Updated:: 11/Jan/23 08:00

Resolved:: 26/Apr/21 12:59

Time Tracking

Estimated:

Not Specified

Remaining:

0h

Logged:

33h 10m

Include sub-tasks