Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
When loading back datasets, it's possible to quickly provide the name of the columns for which data was partitioned using
partitioning=pyarrow.dataset.partitioning(field_names=["year"])
this is convenient because it's easier and quicker than providing the whole schema, which can still be autodetected from the loaded data.
On the other side, we don't support this when saving data. If you provide field_names instead of the schema you will get a crash
pyarrow/dataset.py in _ensure_write_partitioning(scheme) 684 if not isinstance(scheme, Partitioning): 685 # TODO support passing field names, and get types from schema --> 686 raise ValueError("partitioning needs to be actual Partitioning object") 687 return scheme 688
It would be convenient to allow to use field_names only even when saving as we can automatically detect the schema from the table itself that we are saving.
Attachments
Issue Links
- links to