Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
0.15.1
-
None
-
Python 3.7.7
MacOS (Darwin-19.4.0-x86_64-i386-64bit)
Pandas 1.0.3
Pyarrow 0.15.1
Description
When columns are of type CategoricalIndex, saving and reading the table back causes a TypeError: data type "categorical" not understood:
import pandas as pd from pyarrow import parquet, Table base_df = pd.DataFrame([['foo', 'j', "1"], ['bar', 'j', "1"], ['foo', 'j', "1"], ['foobar', 'j', "1"]], columns=['my_cat', 'var', 'for_count']) base_df['my_cat'] = base_df['my_cat'].astype('category') df = ( base_df .groupby(["my_cat", "var"], observed=True) .agg({"for_count": "count"}) .rename(columns={"for_count": "my_cat_counts"}) .unstack(level="my_cat", fill_value=0) ) print(df)
The resulting data frame looks something like this:
my_cat_counts | |||
---|---|---|---|
my_cat | foo | bar | foobar |
var | |||
j | 2 | 1 | 1 |
Then, writing and reading causes the KeyError:
parquet.write_table(Table.from_pandas(df), "test.pqt") parquet.read_table("test.pqt").to_pandas() > TypeError: data type "categorical" not understood
In the example, the column is also a MultiIndex, but that isn't the problem:
df.columns = df.columns.get_level_values(1) parquet.write_table(Table.from_pandas(df), "test.pqt") parquet.read_table("test.pqt").to_pandas() > TypeError: data type "categorical" not understood
This is the workaround suggested on stackoverflow:
df.columns = pd.Index(list(df.columns)) # suggested fix for the time being parquet.write_table(Table.from_pandas(df), "test.pqt") parquet.read_table("test.pqt").to_pandas() # no error
Are there any plans to support the pattern described here in the future?