Details
-
New Feature
-
Status: Closed
-
Minor
-
Resolution: Invalid
-
None
-
None
-
None
Description
In a Pandas DataFrame with a multi-index, with a slowly-varying "outer" index level, one might want to partition by that index level when saving the data frame to Parquet format. This is currently not possible; you need to manually reset the index before writing, and re-add the index after reading. It would be very useful if you could supply the name of an index level to partition_cols instead of (or ideally in addition to) a data column name.
I originally posted this on the Pandas issue tracker (https://github.com/pandas-dev/pandas/issues/47797). Matthew Roeschke looked at the code and figured out that the partitioning functionality was implemented entirely in PyArrow, and that the change would need to happen within PyArrow itself.