[ARROW-11157] [Python] Consistent handling of categoricals - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 2.0.0
Fix Version/s: None
Component/s: Python
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/27067

Description

What is the current state of categoricals with pyarrow? The `categories` parameter mentioned in this GitHub issue does not seem to be accepted in `pd.read_parquet` anymore. I see that read/write of `int` categoricals does not work, though `str` do – except if the file is written by fastparquet.

Using pandas 1.1.5, pyarrow 2.0.0, and fastparquet 0.4.1, I see the following handling of categoricals:

import os
import pandas as pd


fname = '/tmp/tst'


data = {
    'int': pd.Series([0, 1] * 1000, dtype=pd.CategoricalDtype([0,1])),
    'str': pd.Series(['foo', 'bar'] * 1000, dtype=pd.CategoricalDtype(['foo', 'bar'])),
}
df = pd.DataFrame(data)


for write in ['fastparquet', 'pyarrow']:
    for read in ['fastparquet', 'pyarrow']:
        if os.path.exists(fname):
            os.remove(fname)
        df.to_parquet(fname, engine=write, compression=None)
        df_read = pd.read_parquet(fname, engine=read)


        print()
        print('write:', write, 'read:', read)
        for t in data.keys():
            print(t, df[t].dtype == df_read[t].dtype)

write: fastparquet read: fastparquet
int True
str True
write: fastparquet read: pyarrow
int False
str False
write: pyarrow read: fastparquet
int True
str True
write: pyarrow read: pyarrow
int False
str True

Attachments

Issue Links

relates to

ARROW-13342 [Python] Categorical boolean column saved as regular boolean in parquet

Open

Activity

People

Assignee:: Unassigned

Reporter:: Chris Roat

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 07/Jan/21 01:24

Updated:: 11/Jan/23 08:17