Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-17806

pyarrow fails to write and read a dataframe with MultiIndex containing a RangeIndex with Pandas 1.5.0

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 9.0.0
    • None
    • Parquet, Python
    • None

    Description

      A dataframe with a MultiIndex built in this way:

      import pandas as pd
      df1 = pd.DataFrame({"a": [10, 11, 12], "b": [20, 21, 22]}, index=pd.RangeIndex(3, name="idx0"))
      df1 = df1.set_index("b", append=True)
      print(df1)
      print(df1.index.get_level_values("idx0")) 

      gives with Pandas 1.5.0:

                a
      idx0 b     
      0    20  10
      1    21  11
      2    22  12
      
      RangeIndex(start=0, stop=3, step=1, name='idx0')

      while with Pandas 1.4.4:

                a
      idx0 b     
      0    20  10
      1    21  11
      2    22  12
      
      Int64Index([0, 1, 2], dtype='int64', name='idx0')

      i.e. the result is RangeIndex instead of Int64Index.

      With pandas 1.5.0 and pyarrow 9.0.0, writing this DataFrame with index=None (i.e. the default value) as in:

      df1.to_parquet(path, engine="pyarrow", index=None) 

      then reading the same file with:

      pd.read_parquet(path, engine="pyarrow") 

      raises an exception:

       File /<venv>/lib/python3.9/site-packages/pyarrow/pandas_compat.py:997, in _extract_index_level(table, result_table, field_name, field_name_to_metadata)
          995 def _extract_index_level(table, result_table, field_name,
          996                          field_name_to_metadata):
      --> 997     logical_name = field_name_to_metadata[field_name]['name']
          998     index_name = _backwards_compatible_index_name(field_name, logical_name)
          999     i = table.schema.get_field_index(field_name)
      
      KeyError: 'b'
      

      while with pandas 1.4.4 and pyarrow 9.0.0 it works correctly. 

      Note that the problem disappears if the parquet file is written with index=True (that is not the default value), probably because the RangeIndex is converted to Int64Index:

      df1.to_parquet(path, engine="pyarrow", index=True)  

      I suspect that the issue is caused by the change from Int64Index to RangeIndex and it may be related to https://github.com/pandas-dev/pandas/issues/46675

      Should pyarrow be able to handle this case? Or is it an issue with Pandas?

      Attachments

        Activity

          People

            Unassigned Unassigned
            gianluca313 Gianluca Ficarelli
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: