Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-5480

[Python] Pandas categorical type doesn't survive a round-trip through parquet

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 0.11.1, 0.13.0
    • 0.15.0
    • Python
    • python: 3.7.3.final.0
      python-bits: 64
      OS: Linux
      OS-release: 5.0.0-15-generic
      machine: x86_64
      processor: x86_64
      byteorder: little
      pandas: 0.24.2
      numpy: 1.16.4
      pyarrow: 0.13.0

    Description

      Writing a string categorical variable to from pandas parquet is read back as string (object dtype). I expected it to be read as category.
      The same thing happens if the category is numeric – a numeric category is read back as int64.

      In the code below, I tried out an in-memory arrow Table, which successfully translates categories back to pandas. However, when I write to a parquet file, it's not.

      In the scheme of things, this isn't a big deal, but it's a small surprise.

      import pandas as pd
      import pyarrow as pa
      
      
      df = pd.DataFrame({'x': pd.Categorical(['a', 'a', 'b', 'b'])})
      df.dtypes  # category
      
      # This works:
      pa.Table.from_pandas(df).to_pandas().dtypes  # category
      
      df.to_parquet("categories.parquet")
      # This reads back object, but I expected category
      pd.read_parquet("categories.parquet").dtypes  # object
      
      
      # Numeric categories have the same issue:
      df_num = pd.DataFrame({'x': pd.Categorical([1, 1, 2, 2])})
      df_num.dtypes # category
      
      pa.Table.from_pandas(df_num).to_pandas().dtypes  # category
      
      df_num.to_parquet("categories_num.parquet")
      # This reads back int64, but I expected category
      pd.read_parquet("categories_num.parquet").dtypes  # int64
      

      Attachments

        Issue Links

          Activity

            People

              wesm Wes McKinney
              karldw Karl Dunkle Werner
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2.5h
                  2.5h