Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
4.0.1
-
None
-
None
Description
With pyarrow=4.0.1 and pandas=1.2.3, when writing then reading an empty DataFrame with an extension dtype, `to_pandas` subsequently fails to convert the arrow table:
import pandas as pd import pyarrow as pa df1 = pd.DataFrame({"x": pd.Series([], dtype="Int8")}) tbl1 = pa.Table.from_pandas(df1) # In memory roundtrip seems to work fine pa.Table.from_pandas(tbl1.to_pandas()).to_pandas() path = "/tmp/tmp.arr" writer = pa.RecordBatchStreamWriter(path, tbl1.schema) writer.write_table(tbl1) writer.close() reader = pa.RecordBatchStreamReader(path) tbl2 = reader.read_all() assert tbl1.schema.equals(tbl2.schema) assert tbl2.schema.metadata == tbl2.schema.metadata df2 = tbl1.to_pandas() try: df2 = tbl2.to_pandas() except Exception as e: print(f"Error: {e}") df2 = tbl2.replace_schema_metadata(None).to_pandas()
In the above example (with `Int8` as the pandas dtype), the table read from disk cannot be converted to a DataFrame, even though its schema and metadata are supposedly equal to the original table. Removing its metadata "fixes" the issue.
The problem doesn't occur with "normal" dtypes. This may well be a bug in Pandas, but it seems to depend on some change in Arrow's metadata.
The full stacktrace:
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-3-08855adb276d> in <module> ----> 1 df2 = tbl2.to_pandas() ~/miniforge3/envs/grapy/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib._PandasConvertible.to_pandas() ~/miniforge3/envs/grapy/lib/python3.8/site-packages/pyarrow/table.pxi in pyarrow.lib.Table._to_pandas() ~/miniforge3/envs/grapy/lib/python3.8/site-packages/pyarrow/pandas_compat.py in table_to_blockmanager(options, table, categories, ignore_metadata, types_mapper) 787 _check_data_column_metadata_consistency(all_columns) 788 columns = _deserialize_column_index(table, all_columns, column_indexes) --> 789 blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes) 790 791 axes = [columns, index] ~/miniforge3/envs/grapy/lib/python3.8/site-packages/pyarrow/pandas_compat.py in _table_to_blocks(options, block_table, categories, extension_columns) 1128 result = pa.lib.table_to_blocks(options, block_table, categories, 1129 list(extension_columns.keys())) -> 1130 return [_reconstruct_block(item, columns, extension_columns) 1131 for item in result] 1132 ~/miniforge3/envs/grapy/lib/python3.8/site-packages/pyarrow/pandas_compat.py in <listcomp>(.0) 1128 result = pa.lib.table_to_blocks(options, block_table, categories, 1129 list(extension_columns.keys())) -> 1130 return [_reconstruct_block(item, columns, extension_columns) 1131 for item in result] 1132 ~/miniforge3/envs/grapy/lib/python3.8/site-packages/pyarrow/pandas_compat.py in _reconstruct_block(item, columns, extension_columns) 747 raise ValueError("This column does not support to be converted " 748 "to a pandas ExtensionArray") --> 749 pd_ext_arr = pandas_dtype.__from_arrow__(arr) 750 block = _int.make_block(pd_ext_arr, placement=placement) 751 else: ~/miniforge3/envs/grapy/lib/python3.8/site-packages/pandas/core/arrays/integer.py in __from_arrow__(self, array) 119 results.append(int_arr) 120 --> 121 return IntegerArray._concat_same_type(results) 122 123 ~/miniforge3/envs/grapy/lib/python3.8/site-packages/pandas/core/arrays/masked.py in _concat_same_type(cls, to_concat) 269 cls: Type[BaseMaskedArrayT], to_concat: Sequence[BaseMaskedArrayT] 270 ) -> BaseMaskedArrayT: --> 271 data = np.concatenate([x._data for x in to_concat]) 272 mask = np.concatenate([x._mask for x in to_concat]) 273 return cls(data, mask) <__array_function__ internals> in concatenate(*args, **kwargs) ValueError: need at least one array to concatenate