Details
-
Improvement
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
2.0.0
-
None
-
None
Description
What is the current state of categoricals with pyarrow? The `categories` parameter mentioned in this GitHub issue does not seem to be accepted in `pd.read_parquet` anymore. I see that read/write of `int` categoricals does not work, though `str` do – except if the file is written by fastparquet.
Using pandas 1.1.5, pyarrow 2.0.0, and fastparquet 0.4.1, I see the following handling of categoricals:
import os import pandas as pd fname = '/tmp/tst' data = { 'int': pd.Series([0, 1] * 1000, dtype=pd.CategoricalDtype([0,1])), 'str': pd.Series(['foo', 'bar'] * 1000, dtype=pd.CategoricalDtype(['foo', 'bar'])), } df = pd.DataFrame(data) for write in ['fastparquet', 'pyarrow']: for read in ['fastparquet', 'pyarrow']: if os.path.exists(fname): os.remove(fname) df.to_parquet(fname, engine=write, compression=None) df_read = pd.read_parquet(fname, engine=read) print() print('write:', write, 'read:', read) for t in data.keys(): print(t, df[t].dtype == df_read[t].dtype)
write: fastparquet read: fastparquet int True str True write: fastparquet read: pyarrow int False str False write: pyarrow read: fastparquet int True str True write: pyarrow read: pyarrow int False str True
Attachments
Issue Links
- relates to
-
ARROW-13342 [Python] Categorical boolean column saved as regular boolean in parquet
- Open