Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
I'm trying to serialize a pandas DataFrame containing custom objects to parquet. Here is some example code:
import pandas as pd import pyarrow as pa class Foo: pass df = pd.DataFrame({"a": [Foo(), Foo(), Foo()], "b": [1, 2, 3]}) table = pyarrow.Table.from_pandas(df)
Gives me:
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "pyarrow/table.pxi", line 1782, in pyarrow.lib.Table.from_pandas File "/home/migwell/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 594, in dataframe_to_arrays arrays = [convert_column(c, f) File "/home/migwell/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 594, in <listcomp> arrays = [convert_column(c, f) File "/home/migwell/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 581, in convert_column raise e File "/home/migwell/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 575, in convert_column result = pa.array(col, type=type_, from_pandas=True, safe=safe) File "pyarrow/array.pxi", line 312, in pyarrow.lib.array File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: ('Could not convert <__main__.Foo object at 0x7fc23e38bfd0> with type Foo: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column a with type object')
Now, I realise that there's this disclaimer about arbitrary object serialization: https://arrow.apache.org/docs/python/ipc.html#arbitrary-object-serialization. However it isn't clear how this applies to parquet. In my case, I want to have a well-formed parquet file that has binary blobs in one column that can be deserialized to my class, but can otherwise be read by general parquet tools without failing. Using pickle doesn't solve this use case since other languages like R may not be able to read the pickle file.
Alternatively, if there is a well-defined protocol for telling pyarrow how to translate a given type to and from arrow types, I would be happy to use that instead.