Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
9.0.0
-
None
Description
Writing a table to parquet, then reading it back fails if:
- One of the columns is a dictionary (came from a pandas Categorical), and
- Passing the table's schema to `read_table`
Failing on attempt to cast int64 into dictionary (full stack trace below).
This seems related to ARROW-11157 - but even if losing the categorical type when reading from parquet, the reader should not barf when reading with the schema.
Minimal example of failing code:
import pandas as pd import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as ds a = [1,2,3,4,1,2,3,4,1,2,3,4] b = ["a" for i in a] c = [i for i in range(len(a))] df = pd.DataFrame({"a":a, "b":b, "c":c}) df['a'] = df['a'].astype('category') print("df dtypes:\n", df.dtypes) t = pa.Table.from_pandas(df, preserve_index=True) s = t.schema ds.write_dataset(t, format='parquet', base_dir='./test') df2 = pq.read_table('./test', schema=s).to_pandas() print("df2 dtypes:\n", df2.dtypes)
Which gives:
df dtypes: a category b object c int64 dtype: object Traceback (most recent call last): File "/Users/yishai/lab/pyarrow_bug/reproduce.py", line 20, in <module> df2 = pq.read_table('./test', schema=s).to_pandas() File "/Users/yishai/lab/pyarrow_bug/venv/lib/python3.9/site-packages/pyarrow/parquet/_init_.py", line 2827, in read_table return dataset.read(columns=columns, use_threads=use_threads, File "/Users/yishai/lab/pyarrow_bug/venv/lib/python3.9/site-packages/pyarrow/parquet/_init_.py", line 2473, in read table = self._dataset.to_table( File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status pyarrow.lib.ArrowNotImplementedError: Unsupported cast from int64 to dictionary using function cast_dictionary