Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
9.0.0
-
None
-
None
Description
A dataframe with a MultiIndex built in this way:
import pandas as pd df1 = pd.DataFrame({"a": [10, 11, 12], "b": [20, 21, 22]}, index=pd.RangeIndex(3, name="idx0")) df1 = df1.set_index("b", append=True) print(df1) print(df1.index.get_level_values("idx0"))
gives with Pandas 1.5.0:
a
idx0 b
0 20 10
1 21 11
2 22 12
RangeIndex(start=0, stop=3, step=1, name='idx0')
while with Pandas 1.4.4:
a idx0 b 0 20 10 1 21 11 2 22 12 Int64Index([0, 1, 2], dtype='int64', name='idx0')
i.e. the result is RangeIndex instead of Int64Index.
With pandas 1.5.0 and pyarrow 9.0.0, writing this DataFrame with index=None (i.e. the default value) as in:
df1.to_parquet(path, engine="pyarrow", index=None)
then reading the same file with:
pd.read_parquet(path, engine="pyarrow")
raises an exception:
File /<venv>/lib/python3.9/site-packages/pyarrow/pandas_compat.py:997, in _extract_index_level(table, result_table, field_name, field_name_to_metadata) 995 def _extract_index_level(table, result_table, field_name, 996 field_name_to_metadata): --> 997 logical_name = field_name_to_metadata[field_name]['name'] 998 index_name = _backwards_compatible_index_name(field_name, logical_name) 999 i = table.schema.get_field_index(field_name) KeyError: 'b'
while with pandas 1.4.4 and pyarrow 9.0.0 it works correctly.
Note that the problem disappears if the parquet file is written with index=True (that is not the default value), probably because the RangeIndex is converted to Int64Index:
df1.to_parquet(path, engine="pyarrow", index=True)
I suspect that the issue is caused by the change from Int64Index to RangeIndex and it may be related to https://github.com/pandas-dev/pandas/issues/46675
Should pyarrow be able to handle this case? Or is it an issue with Pandas?