Status: Open
Resolution: Unresolved
A dataframe with a MultiIndex built in this way:
import pandas as pd df1 = pd.DataFrame({"a": [10, 11, 12], "b": [20, 21, 22]}, index=pd.RangeIndex(3, name="idx0")) df1 = df1.set_index("b", append=True) print(df1) print(df1.index.get_level_values("idx0"))
gives with Pandas 1.5.0:
idx0 b
0 20 10
1 21 11
2 22 12
RangeIndex(start=0, stop=3, step=1, name='idx0')
while with Pandas 1.4.4:
a idx0 b 0 20 10 1 21 11 2 22 12 Int64Index([0, 1, 2], dtype='int64', name='idx0')
i.e. the result is RangeIndex instead of Int64Index.
With pandas 1.5.0 and pyarrow 9.0.0, writing this DataFrame with index=None (i.e. the default value) as in:
df1.to_parquet(path, engine="pyarrow", index=None)
then reading the same file with:
pd.read_parquet(path, engine="pyarrow")
raises an exception:
File /<venv>/lib/python3.9/site-packages/pyarrow/, in _extract_index_level(table, result_table, field_name, field_name_to_metadata) 995 def _extract_index_level(table, result_table, field_name, 996 field_name_to_metadata): --> 997 logical_name = field_name_to_metadata[field_name]['name'] 998 index_name = _backwards_compatible_index_name(field_name, logical_name) 999 i = table.schema.get_field_index(field_name) KeyError: 'b'
while with pandas 1.4.4 and pyarrow 9.0.0 it works correctly.
Note that the problem disappears if the parquet file is written with index=True (that is not the default value), probably because the RangeIndex is converted to Int64Index:
df1.to_parquet(path, engine="pyarrow", index=True)
I suspect that the issue is caused by the change from Int64Index to RangeIndex and it may be related to
Should pyarrow be able to handle this case? Or is it an issue with Pandas?