[ARROW-11400] [Python] Pickled ParquetFileFragment has invalid partition_expresion with dictionary type in pyarrow 2.0 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 4.0.0
Component/s: Python
Labels:
- dataset
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/27291

Description

From https://github.com/dask/dask/pull/7066#issuecomment-767156623

Simplified reproducer:

import pyarrow.parquet as pq
import pyarrow.dataset as ds

table = pa.table({'part': ['A', 'B']*5, 'col': range(10)})
pq.write_to_dataset(table, "test_partitioned_parquet", partition_cols=["part"])

# with partitioning_kwargs = {} there is no error
partitioning_kwargs = {"max_partition_dictionary_size": -1}
dataset = ds.dataset(
    "test_partitioned_parquet/", format="parquet", 
    partitioning=ds.HivePartitioning.discover( **partitioning_kwargs)
)

frag = list(dataset.get_fragments())[0]

Querying this fragment works fine, but after serialization/deserialization with pickle, it gives errors (and with the original data example I actually got a segfault as well):

In [16]: import pickle

In [17]: frag2 = pickle.loads(pickle.dumps(frag))

In [19]: frag2.partition_expression
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 16: invalid continuation byte

In [20]: frag2.to_table(schema=schema, columns=columns)
Out[20]: 
pyarrow.Table
col: int64
part: dictionary<values=string, indices=int32, ordered=0>

In [21]: frag2.to_table(schema=schema, columns=columns).to_pandas()
...
~/miniconda3/envs/arrow-20/lib/python3.8/site-packages/pyarrow/table.pxi in pyarrow.lib.table_to_blocks()

ArrowException: Unknown error: Wrapping ɻ� failed

It seems the issue was specifically with a partition expression with dictionary type.
Also when using an integer columns as the partition column, you get wrong values (but silently in this case):

In [42]: frag.partition_expression
Out[42]: 
<pyarrow.dataset.Expression (part == [
  1,
  2
][0]:dictionary<values=int32, indices=int32, ordered=0>)>

In [43]: frag2.partition_expression
Out[43]: 
<pyarrow.dataset.Expression (part == [
  170145232,
  32754
][0]:dictionary<values=int32, indices=int32, ordered=0>)>

Now, it seems this is fixed in master. But since I don't remember it was fixed intentionally (bkietz?), it would be good to add some tests for it.

Attachments

Issue Links

links to

GitHub Pull Request #9350

Activity

People

Assignee:: Joris Van den Bossche

Reporter:: Joris Van den Bossche

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 27/Jan/21 14:52

Updated:: 11/Jan/23 08:19

Resolved:: 02/Feb/21 14:09

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 10m