Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
Currently, we have the HivePartitioning and DirectoryPartitioning classes that describe a partitioning used in the discovery phase. But once a dataset object is created, it doesn't know any more about this, it just has partition expressions for the fragments. And the partition keys are added to the schema, but you can't directly know which columns of the schema originated from the partitions.
However, there can be use cases where it would be useful that a dataset still "knows" from what kind of partitioning it was created:
- The "read CSV write back Parquet" use case, where the CSV was already partitioned and you want to automatically preserve that partitioning for parquet (kind of roundtripping the partitioning on read/write)
- To convert the dataset to other representation, eg conversion to pandas, it can be useful to know what columns were partition columns (eg for pandas, those columns might be good candidates to be set as the index of the pandas/dask DataFrame). I can imagine conversions to other systems can use similar information.