Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-17200

[Python][Parquet] support partitioning by Pandas DataFrame index

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Minor
    • Resolution: Invalid
    • None
    • None
    • Parquet, Python
    • None

    Description

      In a Pandas DataFrame with a multi-index, with a slowly-varying "outer" index level, one might want to partition by that index level when saving the data frame to Parquet format. This is currently not possible; you need to manually reset the index before writing, and re-add the index after reading. It would be very useful if you could supply the name of an index level to partition_cols instead of (or ideally in addition to) a data column name.

      I originally posted this on the Pandas issue tracker (https://github.com/pandas-dev/pandas/issues/47797). Matthew Roeschke looked at the code and figured out that the partitioning functionality was implemented entirely in PyArrow, and that the change would need to happen within PyArrow itself.

      Attachments

        Activity

          People

            Unassigned Unassigned
            gwerbin Gregory Werbin
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: