[ARROW-17200] [Python][Parquet] support partitioning by Pandas DataFrame index - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Minor
Resolution: Invalid
Affects Version/s: None
Fix Version/s: None
Component/s: Parquet, Python
Labels:
None

External issue URL:
https://github.com/pandas-dev/pandas/issues/47797

Description

In a Pandas DataFrame with a multi-index, with a slowly-varying "outer" index level, one might want to partition by that index level when saving the data frame to Parquet format. This is currently not possible; you need to manually reset the index before writing, and re-add the index after reading. It would be very useful if you could supply the name of an index level to partition_cols instead of (or ideally in addition to) a data column name.

I originally posted this on the Pandas issue tracker (https://github.com/pandas-dev/pandas/issues/47797). Matthew Roeschke looked at the code and figured out that the partitioning functionality was implemented entirely in PyArrow, and that the change would need to happen within PyArrow itself.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Gregory Werbin

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 25/Jul/22 20:49

Updated:: 11/Jan/23 11:49

Resolved:: 20/Oct/22 12:20