Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
6.0.1
-
None
-
None
Description
Currently, converting a Pandas dataframe with a column of dataframes to Arrow fails with "Could not convert <data> with type DataFrame: did not recognize Python value type when inferring an Arrow data type". We should be able to convert this to a List<Struct> array, similar to how the R binding do it. This could even be bi-directional, where structs could be parsed back into a column of dataframe in to_pandas()
Here is an example that currently fails:
import pandas as pd import pyarrow as pa df1 = pd.DataFrame({ 'x': [1, 2, 3], 'y': ['a', 'b', 'c'] }) df = pd.DataFrame({ 'df': [df1]*10 }) pa.Table.from_pandas(df)
Here's what the other directly might look like for the same data:
sub_tab = [{'x': 1, 'y': 'a'}, {'x': 2, 'y': 'b'}, {'x': 3, 'y': 'c'}] tab = pa.table({ 'df': pa.array([sub_tab]*10) }) print(tab.schema) # df: list<item: struct<x: int64, y: string>> # child 0, item: struct<x: int64, y: string> # child 0, x: int64 # child 1, y: string tab.to_pandas()