[ARROW-16240] [Python] Support row_group_size/chunk_size keyword in pq.write_to_dataset with use_legacy_dataset=False - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 8.0.0
Component/s: Python
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/31636

Description

The pq.write_to_dataset (legacy implementation) supports the row_group_size/chunk_size keyword to specify the row group size of the written parquet files.

Now that we made use_legacy_dataset=False the default, this keyword doesn't work anymore.

This is because dataset.write_dataset(..) doesn't support the parquet row_group_size keyword. The ParquetFileWriteOptions class doesn't support this keyword.

On the parquet side, this is also the only keyword that is not passed to the ParquetWriter init (and thus to parquet's WriterProperties or ArrowWriterProperties), but to the actual write_table call. In C++ this can be seen at https://github.com/apache/arrow/blob/76d064c729f5e2287bf2a2d5e02d1fb192ae5738/cpp/src/parquet/arrow/writer.h#L62-L71

See discussion: https://github.com/apache/arrow/pull/12811#discussion_r845304218

Attachments

Issue Links

links to

GitHub Pull Request #12955

Activity

People

Assignee:: Alenka Frim

Reporter:: Alenka Frim

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 19/Apr/22 18:03

Updated:: 11/Jan/23 11:42

Resolved:: 22/Apr/22 21:43

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 10m