[ARROW-13755] [Python] Allow usage of field_names in partitioning when saving datasets - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 6.0.0
Component/s: Python
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/29385

Description

When loading back datasets, it's possible to quickly provide the name of the columns for which data was partitioned using

partitioning=pyarrow.dataset.partitioning(field_names=["year"])

this is convenient because it's easier and quicker than providing the whole schema, which can still be autodetected from the loaded data.

On the other side, we don't support this when saving data. If you provide field_names instead of the schema you will get a crash

pyarrow/dataset.py in _ensure_write_partitioning(scheme)
    684     if not isinstance(scheme, Partitioning):
    685         # TODO support passing field names, and get types from schema
--> 686         raise ValueError("partitioning needs to be actual Partitioning object")
    687     return scheme
    688

It would be convenient to allow to use field_names only even when saving as we can automatically detect the schema from the table itself that we are saving.

Attachments

Issue Links

links to

GitHub Pull Request #11008

Activity

People

Assignee:: Alessandro Molina

Reporter:: Alessandro Molina

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 25/Aug/21 14:42

Updated:: 11/Jan/23 08:35

Resolved:: 21/Sep/21 15:50

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: