Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8812

[Python] Column names of type CategoricalIndex fails to convert back to pandas

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 0.15.1
    • None
    • Python
    • Python 3.7.7
      MacOS (Darwin-19.4.0-x86_64-i386-64bit)
      Pandas 1.0.3
      Pyarrow 0.15.1

    Description

      When columns are of type CategoricalIndex, saving and reading the table back causes a TypeError: data type "categorical" not understood:

      import pandas as pd
      from pyarrow import parquet, Table
      
      base_df = pd.DataFrame([['foo', 'j', "1"],
                              ['bar', 'j', "1"],
                              ['foo', 'j', "1"],
                              ['foobar', 'j', "1"]],
                             columns=['my_cat', 'var', 'for_count'])
      
      base_df['my_cat'] = base_df['my_cat'].astype('category')
      
      df = (
          base_df
          .groupby(["my_cat", "var"], observed=True)
          .agg({"for_count": "count"})
          .rename(columns={"for_count": "my_cat_counts"})
          .unstack(level="my_cat", fill_value=0)
      )
      
      print(df)
      

      The resulting data frame looks something like this:

        my_cat_counts    
      my_cat foo bar foobar
      var      
      j 2 1 1

      Then, writing and reading causes the KeyError:

      parquet.write_table(Table.from_pandas(df), "test.pqt")
      parquet.read_table("test.pqt").to_pandas()
      > TypeError: data type "categorical" not understood
      

      In the example, the column is also a MultiIndex, but that isn't the problem:

      df.columns = df.columns.get_level_values(1)
      parquet.write_table(Table.from_pandas(df), "test.pqt")
      parquet.read_table("test.pqt").to_pandas()
      > TypeError: data type "categorical" not understood
      

      This is the workaround suggested on stackoverflow:

      df.columns = pd.Index(list(df.columns)) # suggested fix for the time being
      parquet.write_table(Table.from_pandas(df), "test.pqt")
      parquet.read_table("test.pqt").to_pandas() # no error
      

      Are there any plans to support the pattern described here in the future?

      Attachments

        Activity

          People

            Unassigned Unassigned
            jonas-nelle Jonas Nelle
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: