Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-5480

[Python] Pandas categorical type doesn't survive a round-trip through parquet

    Details

    • Type: Improvement
    • Status: In Progress
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 0.11.1, 0.13.0
    • Fix Version/s: None
    • Component/s: Python
    • Environment:
      python: 3.7.3.final.0
      python-bits: 64
      OS: Linux
      OS-release: 5.0.0-15-generic
      machine: x86_64
      processor: x86_64
      byteorder: little
      pandas: 0.24.2
      numpy: 1.16.4
      pyarrow: 0.13.0

      Description

      Writing a string categorical variable to from pandas parquet is read back as string (object dtype). I expected it to be read as category.
      The same thing happens if the category is numeric – a numeric category is read back as int64.

      In the code below, I tried out an in-memory arrow Table, which successfully translates categories back to pandas. However, when I write to a parquet file, it's not.

      In the scheme of things, this isn't a big deal, but it's a small surprise.

      import pandas as pd
      import pyarrow as pa
      
      
      df = pd.DataFrame({'x': pd.Categorical(['a', 'a', 'b', 'b'])})
      df.dtypes  # category
      
      # This works:
      pa.Table.from_pandas(df).to_pandas().dtypes  # category
      
      df.to_parquet("categories.parquet")
      # This reads back object, but I expected category
      pd.read_parquet("categories.parquet").dtypes  # object
      
      
      # Numeric categories have the same issue:
      df_num = pd.DataFrame({'x': pd.Categorical([1, 1, 2, 2])})
      df_num.dtypes # category
      
      pa.Table.from_pandas(df_num).to_pandas().dtypes  # category
      
      df_num.to_parquet("categories_num.parquet")
      # This reads back int64, but I expected category
      pd.read_parquet("categories_num.parquet").dtypes  # int64
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                wesmckinn Wes McKinney
                Reporter:
                karldw Karl Dunkle Werner
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 20m
                  1h 20m