[ARROW-3020] [Python] Addition of option to allow empty Parquet row groups - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.12.0
Component/s: C++, Python
Labels:
- parquet
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/19381

Description

While our use case is not common, I was able to find one related request from roughly a year ago. Could this be added as a feature?

https://issues.apache.org/jira/browse/PARQUET-1047

Motivation

We have an application where each row is associated with one of N contexts, though a minority of contexts may have no associated rows. When encountering the Nth context, we will wish to retrieve all the associated rows. Row groups would provide a natural way to index the data, as the nth context could naturally relate to the nth row group.

Unfortunately, this is not possible at the present time, as pyarrow does not support writing empty row groups. If one writes a pyarrow.Table containing zero rows using pyarrow.parquet.ParquetWriter, it is omitted from the final file, and this distorts the indexing.

Attachments

Issue Links

links to

GitHub Pull Request #3269

Activity

People

Assignee:: Wes McKinney

Reporter:: Alex Mendelson

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 08/Aug/18 07:35

Updated:: 11/Jan/23 07:24

Resolved:: 28/Dec/18 14:57

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

20m