Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-17852

[python] `dtype` of `Categorical` category columns are not preserved

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 9.0.0
    • None
    • Python

    Description

      Hi there,

      First time submitting an issue here so apologies if there's anything I've missed.

      I see the below bug, where by the dtype of the categories themselves (within a pd.Categorical are not preserved on a round trip via pyarrow. Hopefully the snippet below demonstrates the issue.

      The reason this causes an issue, is because the dtypes need to be the same in order for the categories to be considered the same (so they can then be concatenated, for example).

      Current workaround is to store as a plain pd.StringDtype() and then convert to pd.Categorical in memory with Pandas (which infers from the underlying type, but in doing so sacrifices disk saving of storing as a dictionary).

      Using pyarrow 9.0.0 and pandas 1.4.4.

      Thanks
       

      import pandas as pd

      import pyarrow as pa

       

      # note, Categorical column B is constructed from `pd.StringDtype`

      df = pd.DataFrame({"A": ["a", "b", "c", "a"]}, dtype=pd.StringDtype())

      df["B"] = df["A"].astype("category")

      print(df["B"].cat.categories)
      # Index(['a', 'b', 'c'], dtype='string')

       

      # however, this is downcast to `object` during a roundtrip

      print(pa.Table.from_pandas(df).to_pandas()["B"].cat.categories)

      # Index(['a', 'b', 'c'], dtype='object')

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            bollard Ryan Ballard
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: