[ARROW-17228] [Python] dataset.write_data should use Scanner.projected_schema when passed a scanner with projected columns - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 8.0.0
Fix Version/s: 10.0.0
Component/s: Python
Labels:
- pull-request-available
Environment:
Python 3.9.13
pyarrow 8.0.0

External issue URL:
https://github.com/apache/arrow/issues/20344
Language:
- python

Description

In the code below:

import pyarrow as pa
import pyarrow.dataset as ds

table = pa.Table.from_arrays(
    [
        pa.array(['a', 'b', 'c'], pa.string()),
        pa.array(['a', 'b', 'c'], pa.string()),
    ],
    names=['region', "Other"]
)
table_dataset = ds.dataset(table)
columns = {
    "Region": ds.field('region'),
    "Other": ds.field('Other'),
}
scanner = table_dataset.scanner(columns=columns)

ds.write_dataset(
    scanner,
    'newpath',
    partitioning=['Region'], partitioning_flavor='hive',
    format='parquet')

I get this exception:

KeyError: 'Column Region does not exist in schema'

I suspect it is because write_dataset isn't looking at the correct schema. It should look at scanner.project_schema (rather than scanner.dataset_schema).

I think it's just a matter of updating this line: https://github.com/apache/arrow/blob/bc6c4988691cf60ecac67542b2daa2ac19fde5d9/python/pyarrow/dataset.py#L967

The issue was raised here: https://stackoverflow.com/questions/73139467/how-to-incorporate-projected-columns-in-scanner-into-new-dataset-partitioning

Attachments

Issue Links

links to

GitHub Pull Request #13756

Activity

People

Assignee:: &res

Reporter:: &res

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 27/Jul/22 17:00

Updated:: 11/Jan/23 11:49

Resolved:: 02/Aug/22 12:51

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: