Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8655

[C++][Dataset][Python][R] Preserve partitioning information for a discovered Dataset

    XMLWordPrintableJSON

Details

    Description

      Currently, we have the HivePartitioning and DirectoryPartitioning classes that describe a partitioning used in the discovery phase. But once a dataset object is created, it doesn't know any more about this, it just has partition expressions for the fragments. And the partition keys are added to the schema, but you can't directly know which columns of the schema originated from the partitions.

      However, there can be use cases where it would be useful that a dataset still "knows" from what kind of partitioning it was created:

      • The "read CSV write back Parquet" use case, where the CSV was already partitioned and you want to automatically preserve that partitioning for parquet (kind of roundtripping the partitioning on read/write)
      • To convert the dataset to other representation, eg conversion to pandas, it can be useful to know what columns were partition columns (eg for pandas, those columns might be good candidates to be set as the index of the pandas/dask DataFrame). I can imagine conversions to other systems can use similar information.

      Attachments

        Activity

          People

            jorisvandenbossche Joris Van den Bossche
            jorisvandenbossche Joris Van den Bossche
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 3h 40m
                3h 40m