Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
4.0.1
-
None
-
None
Description
When saving a pandas dataframe to parquet, if there is a categorical column where the categories are boolean, the column is saved as regular boolean.
This causes an issue because, when reading back the parquet file, I expect the column to still be categorical.
Reproducible example:
import pandas as pd import pyarrow # Create dataframe with boolean column that is then converted to categorical df = pd.DataFrame({'a': [True, True, False, True, False]}) df['a'] = df['a'].astype('category') # Convert to arrow Table and save to disk table = pyarrow.Table.from_pandas(df) pyarrow.parquet.write_table(table, 'test.parquet') # Reload data and convert back to pandas table_rel = pyarrow.parquet.read_table('test.parquet') df_rel = table_rel.to_pandas()
The arrow table variable correctly converts the column to an arrow DICTIONARY type:
>>> df['a'] 0 True 1 True 2 False 3 True 4 False Name: a, dtype: category Categories (2, object): [False, True] >>> >>> table pyarrow.Table a: dictionary<values=bool, indices=int8, ordered=0>
However, the reloaded column is now a regular boolean:
>>> table_rel pyarrow.Table a: bool >>> >>> df_rel['a'] 0 True 1 True 2 False 3 True 4 False Name: a, dtype: bool
I would have expected the column to be read back as categorical.
Attachments
Issue Links
- is blocked by
-
ARROW-6140 [C++][Parquet] Support direct dictionary decoding of types other than BYTE_ARRAY
- Open
- is related to
-
ARROW-11157 [Python] Consistent handling of categoricals
- Open