[ARROW-15311] [Python][Docs] Opening a partitioned dataset with schema and filter - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Documentation, Python
Labels:
- docs
- python

External issue URL:
https://github.com/apache/arrow/issues/30800

Description

Add a note to the docs that if partitioning and schema are both specified at opening of a dataset and partitioning names are not included in the data, schema needs to include the partitioning names (directory or hive partitioning) in a case that filtering will be done.

Example:

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

# Define the data
table = pa.table({'one': [-1, np.nan, 2.5],
                   'two': ['foo', 'bar', 'baz'],
                   'three': [True, False, True]})

# Write to partitioned dataset
# The files will include columns "two" and "three"
pq.write_to_dataset(table, root_path='dataset_name',
                    partition_cols=['one'])

# Reading the partitioned dataset with schema not including partitioned names
# will error

schema = pa.schema([("three", "double")])
data = ds.dataset("dataset_name", partitioning="hive", schema=schema)
subset = ds.field("one") == 2.5
data.to_table(filter=subset)

# And will not if done like so:
schema = pa.schema([("three", "double"), ("one", "double")])
data = ds.dataset("dataset_name", partitioning="hive", schema=schema)
subset = ds.field("one") == 2.5
data.to_table(filter=subset)

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Alenka Frim

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 12/Jan/22 11:35

Updated:: 11/Jan/23 11:36