[ARROW-12631] [Python] pyarrow.dataset.write_table should accept a Scanner to write - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 5.0.0
Component/s: Python
Labels:
- dataset
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/28383

Description

Assume you open a dataset and want to write it back with some projected columns. Currently you need to actually materialize it to a Table or convert it to an iterator of batches, for being able to write the dataset:

import pyarrow.dataset as ds

dataset = ds.dataset(pa.table({'a': [1, 2, 3]}))

# write with projected columns
projection = {'b': ds.field('a')}

# this works but materializes full table
ds.write_dataset(dataset.to_table(columns=projection), "test.parquet", format="parquet")

# this requires the exact schema, which is a bit annoying as you need to construct that manually
ds.write_dataset(dataset.to_batches(columns=projection), "test.parquet", format="parquet", schema=...<projected schema>...)

You could expect to do the following?

ds.write_dataset(dataset.scanner(columns=projection), "test.parquet", format="parquet")

cc lidavidm do you think this logic is correct?

(encountered this while trying to reproduce ~~ARROW-12620~~ in Python)

Attachments

Issue Links

is related to

ARROW-12647 [Python][Dataset] Consider allowing projecting/scanning with a given schema

Open

links to

GitHub Pull Request #10224

Activity

People

Assignee:: Joris Van den Bossche

Reporter:: Joris Van den Bossche

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 03/May/21 12:54

Updated:: 11/Jan/23 08:27

Resolved:: 04/May/21 12:44

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 40m