Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
8.0.0
-
Python 3.9.13
pyarrow 8.0.0
Description
In the code below:
import pyarrow as pa import pyarrow.dataset as ds table = pa.Table.from_arrays( [ pa.array(['a', 'b', 'c'], pa.string()), pa.array(['a', 'b', 'c'], pa.string()), ], names=['region', "Other"] ) table_dataset = ds.dataset(table) columns = { "Region": ds.field('region'), "Other": ds.field('Other'), } scanner = table_dataset.scanner(columns=columns) ds.write_dataset( scanner, 'newpath', partitioning=['Region'], partitioning_flavor='hive', format='parquet')
I get this exception:
KeyError: 'Column Region does not exist in schema'
I suspect it is because write_dataset isn't looking at the correct schema. It should look at scanner.project_schema (rather than scanner.dataset_schema).
I think it's just a matter of updating this line: https://github.com/apache/arrow/blob/bc6c4988691cf60ecac67542b2daa2ac19fde5d9/python/pyarrow/dataset.py#L967
The issue was raised here: https://stackoverflow.com/questions/73139467/how-to-incorporate-projected-columns-in-scanner-into-new-dataset-partitioning
Attachments
Issue Links
- links to