Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11157

[Python] Consistent handling of categoricals

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 2.0.0
    • None
    • Python
    • None

    Description

      What is the current state of categoricals with pyarrow? The `categories` parameter mentioned in this GitHub issue does not seem to be accepted in `pd.read_parquet` anymore. I see that read/write of `int` categoricals does not work, though `str` do – except if the file is written by fastparquet.

      Using pandas 1.1.5, pyarrow 2.0.0, and fastparquet 0.4.1, I see the following handling of categoricals:

       

      import os
      import pandas as pd
      
      
      fname = '/tmp/tst'
      
      
      data = {
          'int': pd.Series([0, 1] * 1000, dtype=pd.CategoricalDtype([0,1])),
          'str': pd.Series(['foo', 'bar'] * 1000, dtype=pd.CategoricalDtype(['foo', 'bar'])),
      }
      df = pd.DataFrame(data)
      
      
      for write in ['fastparquet', 'pyarrow']:
          for read in ['fastparquet', 'pyarrow']:
              if os.path.exists(fname):
                  os.remove(fname)
              df.to_parquet(fname, engine=write, compression=None)
              df_read = pd.read_parquet(fname, engine=read)
      
      
              print()
              print('write:', write, 'read:', read)
              for t in data.keys():
                  print(t, df[t].dtype == df_read[t].dtype)

       

       

      write: fastparquet read: fastparquet
      int True
      str True
      write: fastparquet read: pyarrow
      int False
      str False
      write: pyarrow read: fastparquet
      int True
      str True
      write: pyarrow read: pyarrow
      int False
      str True
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              chrisroat Chris Roat
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: