[ARROW-17852] [python] `dtype` of `Categorical` category columns are not preserved - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 9.0.0
Fix Version/s: None
Component/s: Python
Labels:

External issue URL:
https://github.com/apache/arrow/issues/33070

Description

Hi there,

First time submitting an issue here so apologies if there's anything I've missed.

I see the below bug, where by the dtype of the categories themselves (within a pd.Categorical are not preserved on a round trip via pyarrow. Hopefully the snippet below demonstrates the issue.

The reason this causes an issue, is because the dtypes need to be the same in order for the categories to be considered the same (so they can then be concatenated, for example).

Current workaround is to store as a plain pd.StringDtype() and then convert to pd.Categorical in memory with Pandas (which infers from the underlying type, but in doing so sacrifices disk saving of storing as a dictionary).

Using pyarrow 9.0.0 and pandas 1.4.4.

Thanks

import pandas as pd

import pyarrow as pa

# note, Categorical column B is constructed from `pd.StringDtype`

df = pd.DataFrame({"A": ["a", "b", "c", "a"]}, dtype=pd.StringDtype())

df["B"] = df["A"].astype("category")

print(df["B"].cat.categories)
# Index(['a', 'b', 'c'], dtype='string')

# however, this is downcast to `object` during a roundtrip

print(pa.Table.from_pandas(df).to_pandas()["B"].cat.categories)

# Index(['a', 'b', 'c'], dtype='object')

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Ryan Ballard

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 27/Sep/22 09:47

Updated:: 11/Jan/23 11:56