[ARROW-3652] [Python] CategoricalIndex is lost after reading back - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.11.1
Fix Version/s: 0.15.0
Component/s: Python
Labels:
- parquet
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/19959

Description

When a CategoricalIndex is written and read back the resulting index is not more categorical.

df = pd.DataFrame([['a', 'b'], ['c', 'd']], columns=['c1', 'c2'])
df['c1'] = df['c1'].astype('category')
df = df.set_index(['c1'])

table = pa.Table.from_pandas(df)
pq.write_table(table, 'test.parquet')

ref_df = pq.read_pandas('test.parquet').to_pandas()

print(df.index)
# CategoricalIndex(['a', 'c'], categories=['a', 'c'], ordered=False, name='c1', dtype='category')

print(ref_df.index)
# Index(['a', 'c'], dtype='object', name='c1')

In the metadata the information is correctly contained:

{"name": "c1", "field_name": "c1", "p'
            b'andas_type": "categorical", "numpy_type": "int8", "metadata": {"'
            b'num_categories": 2, "ordered": false}

Attachments

Issue Links

relates to

ARROW-3325 [Python] Support reading Parquet binary/string columns directly as DictionaryArray

Resolved

ARROW-3772 [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray

Resolved

ARROW-3246 [Python][Parquet] direct reading/writing of pandas categoricals in parquet

Resolved

ARROW-5480 [Python] Pandas categorical type doesn't survive a round-trip through parquet

Resolved

links to

GitHub Pull Request #5117

Activity

People

Assignee:: Wes McKinney

Reporter:: Armin Berres

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 30/Oct/18 10:19

Updated:: 11/Jan/23 07:28

Resolved:: 19/Aug/19 19:18

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

0.5h