[ARROW-13342] [Python] Categorical boolean column saved as regular boolean in parquet - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 4.0.1
Fix Version/s: None
Component/s: Parquet, Python
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/29017

Description

When saving a pandas dataframe to parquet, if there is a categorical column where the categories are boolean, the column is saved as regular boolean.

This causes an issue because, when reading back the parquet file, I expect the column to still be categorical.

Reproducible example:

import pandas as pd
import pyarrow

# Create dataframe with boolean column that is then converted to categorical
df = pd.DataFrame({'a': [True, True, False, True, False]})
df['a'] = df['a'].astype('category')

# Convert to arrow Table and save to disk
table = pyarrow.Table.from_pandas(df)
pyarrow.parquet.write_table(table, 'test.parquet')

# Reload data and convert back to pandas
table_rel = pyarrow.parquet.read_table('test.parquet')
df_rel = table_rel.to_pandas()

The arrow table variable correctly converts the column to an arrow DICTIONARY type:

>>> df['a']
0     True
1     True
2    False
3     True
4    False
Name: a, dtype: category
Categories (2, object): [False, True]
>>>
>>> table
pyarrow.Table
a: dictionary<values=bool, indices=int8, ordered=0>

However, the reloaded column is now a regular boolean:

>>> table_rel
pyarrow.Table
a: bool
>>>
>>> df_rel['a']
0     True
1     True
2    False
3     True
4    False
Name: a, dtype: bool

I would have expected the column to be read back as categorical.

Attachments

Issue Links

is blocked by

ARROW-6140 [C++][Parquet] Support direct dictionary decoding of types other than BYTE_ARRAY

Open

is related to

ARROW-11157 [Python] Consistent handling of categoricals

Open

Activity

People

Assignee:: Unassigned

Reporter:: Joao Moreira

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 14/Jul/21 18:32

Updated:: 11/Jan/23 08:32