Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.17.1
Description
Apparently, ARROW-9288 was not fully / correctly fixing the issue. With a single string partition field, it now works fine. But once you have multiple string fields, you get parsing errors.
A reproducible example:
import numpy as np import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as ds foo_keys = np.array(['a', 'b', 'c'], dtype=object) bar_keys = np.array(['d', 'e', 'f'], dtype=object) N = 30 table = pa.table({ 'foo': foo_keys.repeat(10), 'bar': np.tile(np.tile(bar_keys, 5), 2), 'values': np.random.randn(N) }) base_path = "test_partition_directories3" pq.write_to_dataset(table, base_path, partition_cols=["bar", "foo"]) # works ds.dataset(base_path, partitioning="hive") # fails part = ds.HivePartitioning.discover(max_partition_dictionary_size=-1) ds.dataset(base_path, partitioning=part)
cc bkietz
Attachments
Issue Links
- links to