Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
Add a note to the docs that if partitioning and schema are both specified at opening of a dataset and partitioning names are not included in the data, schema needs to include the partitioning names (directory or hive partitioning) in a case that filtering will be done.
Example:
import numpy as np import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as ds # Define the data table = pa.table({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]}) # Write to partitioned dataset # The files will include columns "two" and "three" pq.write_to_dataset(table, root_path='dataset_name', partition_cols=['one']) # Reading the partitioned dataset with schema not including partitioned names # will error schema = pa.schema([("three", "double")]) data = ds.dataset("dataset_name", partitioning="hive", schema=schema) subset = ds.field("one") == 2.5 data.to_table(filter=subset) # And will not if done like so: schema = pa.schema([("three", "double"), ("one", "double")]) data = ds.dataset("dataset_name", partitioning="hive", schema=schema) subset = ds.field("one") == 2.5 data.to_table(filter=subset)