[ARROW-17806] pyarrow fails to write and read a dataframe with MultiIndex containing a RangeIndex with Pandas 1.5.0 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 9.0.0
Fix Version/s: None
Component/s: Parquet, Python
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/33030

Description

A dataframe with a MultiIndex built in this way:

import pandas as pd
df1 = pd.DataFrame({"a": [10, 11, 12], "b": [20, 21, 22]}, index=pd.RangeIndex(3, name="idx0"))
df1 = df1.set_index("b", append=True)
print(df1)
print(df1.index.get_level_values("idx0"))

gives with Pandas 1.5.0:

          a
idx0 b     
0    20  10
1    21  11
2    22  12

RangeIndex(start=0, stop=3, step=1, name='idx0')

while with Pandas 1.4.4:

          a
idx0 b     
0    20  10
1    21  11
2    22  12

Int64Index([0, 1, 2], dtype='int64', name='idx0')

i.e. the result is RangeIndex instead of Int64Index.

With pandas 1.5.0 and pyarrow 9.0.0, writing this DataFrame with index=None (i.e. the default value) as in:

df1.to_parquet(path, engine="pyarrow", index=None)

then reading the same file with:

pd.read_parquet(path, engine="pyarrow")

raises an exception:

 File /<venv>/lib/python3.9/site-packages/pyarrow/pandas_compat.py:997, in _extract_index_level(table, result_table, field_name, field_name_to_metadata)
    995 def _extract_index_level(table, result_table, field_name,
    996                          field_name_to_metadata):
--> 997     logical_name = field_name_to_metadata[field_name]['name']
    998     index_name = _backwards_compatible_index_name(field_name, logical_name)
    999     i = table.schema.get_field_index(field_name)

KeyError: 'b'

while with pandas 1.4.4 and pyarrow 9.0.0 it works correctly.

Note that the problem disappears if the parquet file is written with index=True (that is not the default value), probably because the RangeIndex is converted to Int64Index:

df1.to_parquet(path, engine="pyarrow", index=True)

I suspect that the issue is caused by the change from Int64Index to RangeIndex and it may be related to https://github.com/pandas-dev/pandas/issues/46675

Should pyarrow be able to handle this case? Or is it an issue with Pandas?

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Gianluca Ficarelli

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 21/Sep/22 15:29

Updated:: 11/Jan/23 11:55