Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
None
Description
From https://github.com/dask/dask/pull/7066#issuecomment-767156623
Simplified reproducer:
import pyarrow.parquet as pq import pyarrow.dataset as ds table = pa.table({'part': ['A', 'B']*5, 'col': range(10)}) pq.write_to_dataset(table, "test_partitioned_parquet", partition_cols=["part"]) # with partitioning_kwargs = {} there is no error partitioning_kwargs = {"max_partition_dictionary_size": -1} dataset = ds.dataset( "test_partitioned_parquet/", format="parquet", partitioning=ds.HivePartitioning.discover( **partitioning_kwargs) ) frag = list(dataset.get_fragments())[0]
Querying this fragment works fine, but after serialization/deserialization with pickle, it gives errors (and with the original data example I actually got a segfault as well):
In [16]: import pickle In [17]: frag2 = pickle.loads(pickle.dumps(frag)) In [19]: frag2.partition_expression ... UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 16: invalid continuation byte In [20]: frag2.to_table(schema=schema, columns=columns) Out[20]: pyarrow.Table col: int64 part: dictionary<values=string, indices=int32, ordered=0> In [21]: frag2.to_table(schema=schema, columns=columns).to_pandas() ... ~/miniconda3/envs/arrow-20/lib/python3.8/site-packages/pyarrow/table.pxi in pyarrow.lib.table_to_blocks() ArrowException: Unknown error: Wrapping ɻ� failed
It seems the issue was specifically with a partition expression with dictionary type.
Also when using an integer columns as the partition column, you get wrong values (but silently in this case):
In [42]: frag.partition_expression Out[42]: <pyarrow.dataset.Expression (part == [ 1, 2 ][0]:dictionary<values=int32, indices=int32, ordered=0>)> In [43]: frag2.partition_expression Out[43]: <pyarrow.dataset.Expression (part == [ 170145232, 32754 ][0]:dictionary<values=int32, indices=int32, ordered=0>)>
Now, it seems this is fixed in master. But since I don't remember it was fixed intentionally (bkietz?), it would be good to add some tests for it.
Attachments
Issue Links
- links to