Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-13413

[Python] IPC roundtrip fails in to_pandas with empty table and extension type

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 4.0.1
    • None
    • C++, Python
    • None

    Description

      With pyarrow=4.0.1 and pandas=1.2.3, when writing then reading an empty DataFrame with an extension dtype, `to_pandas` subsequently fails to convert the arrow table:

      import pandas as pd
      import pyarrow as pa
      
      df1 = pd.DataFrame({"x": pd.Series([], dtype="Int8")})
      tbl1 = pa.Table.from_pandas(df1)
      
      # In memory roundtrip seems to work fine
      pa.Table.from_pandas(tbl1.to_pandas()).to_pandas()
      
      path = "/tmp/tmp.arr"
      writer = pa.RecordBatchStreamWriter(path, tbl1.schema)
      writer.write_table(tbl1)
      writer.close()
      reader = pa.RecordBatchStreamReader(path)
      tbl2 = reader.read_all()
      
      assert tbl1.schema.equals(tbl2.schema)
      assert tbl2.schema.metadata == tbl2.schema.metadata
      
      df2 = tbl1.to_pandas()
      try:
          df2 = tbl2.to_pandas()
      except Exception as e:
          print(f"Error: {e}")
          df2 = tbl2.replace_schema_metadata(None).to_pandas()
      

      In the above example (with `Int8` as the pandas dtype), the table read from disk cannot be converted to a DataFrame, even though its schema and metadata are supposedly equal  to the original table. Removing its metadata "fixes" the issue.

      The problem doesn't occur with "normal" dtypes. This may well be a bug in Pandas, but it seems to depend on some change in Arrow's metadata.

      The full stacktrace:

      ---------------------------------------------------------------------------
      ValueError                                Traceback (most recent call last)
      <ipython-input-3-08855adb276d> in <module>
      ----> 1 df2 = tbl2.to_pandas()
      
      ~/miniforge3/envs/grapy/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib._PandasConvertible.to_pandas()
      
      ~/miniforge3/envs/grapy/lib/python3.8/site-packages/pyarrow/table.pxi in pyarrow.lib.Table._to_pandas()
      
      ~/miniforge3/envs/grapy/lib/python3.8/site-packages/pyarrow/pandas_compat.py in table_to_blockmanager(options, table, categories, ignore_metadata, types_mapper)
          787     _check_data_column_metadata_consistency(all_columns)
          788     columns = _deserialize_column_index(table, all_columns, column_indexes)
      --> 789     blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
          790 
          791     axes = [columns, index]
      
      ~/miniforge3/envs/grapy/lib/python3.8/site-packages/pyarrow/pandas_compat.py in _table_to_blocks(options, block_table, categories, extension_columns)
         1128     result = pa.lib.table_to_blocks(options, block_table, categories,
         1129                                     list(extension_columns.keys()))
      -> 1130     return [_reconstruct_block(item, columns, extension_columns)
         1131             for item in result]
         1132 
      
      ~/miniforge3/envs/grapy/lib/python3.8/site-packages/pyarrow/pandas_compat.py in <listcomp>(.0)
         1128     result = pa.lib.table_to_blocks(options, block_table, categories,
         1129                                     list(extension_columns.keys()))
      -> 1130     return [_reconstruct_block(item, columns, extension_columns)
         1131             for item in result]
         1132 
      
      ~/miniforge3/envs/grapy/lib/python3.8/site-packages/pyarrow/pandas_compat.py in _reconstruct_block(item, columns, extension_columns)
          747             raise ValueError("This column does not support to be converted "
          748                              "to a pandas ExtensionArray")
      --> 749         pd_ext_arr = pandas_dtype.__from_arrow__(arr)
          750         block = _int.make_block(pd_ext_arr, placement=placement)
          751     else:
      
      ~/miniforge3/envs/grapy/lib/python3.8/site-packages/pandas/core/arrays/integer.py in __from_arrow__(self, array)
          119             results.append(int_arr)
          120 
      --> 121         return IntegerArray._concat_same_type(results)
          122 
          123 
      
      ~/miniforge3/envs/grapy/lib/python3.8/site-packages/pandas/core/arrays/masked.py in _concat_same_type(cls, to_concat)
          269         cls: Type[BaseMaskedArrayT], to_concat: Sequence[BaseMaskedArrayT]
          270     ) -> BaseMaskedArrayT:
      --> 271         data = np.concatenate([x._data for x in to_concat])
          272         mask = np.concatenate([x._mask for x in to_concat])
          273         return cls(data, mask)
      
      <__array_function__ internals> in concatenate(*args, **kwargs)
      
      ValueError: need at least one array to concatenate
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            buhrmann Thomas Buhrmann
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: