Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-15826

[Python] Allow serializing arbitrary Python objects to parquet

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Parquet, Python
    • None

    Description

      I'm trying to serialize a pandas DataFrame containing custom objects to parquet. Here is some example code:

      import pandas as pd
      import pyarrow as pa
      
      class Foo: 
          pass
      
      df = pd.DataFrame({"a": [Foo(), Foo(), Foo()], "b": [1, 2, 3]})
      table = pyarrow.Table.from_pandas(df)
      

      Gives me:

      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "pyarrow/table.pxi", line 1782, in pyarrow.lib.Table.from_pandas
        File "/home/migwell/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 594, in dataframe_to_arrays
          arrays = [convert_column(c, f)
        File "/home/migwell/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 594, in <listcomp>
          arrays = [convert_column(c, f)
        File "/home/migwell/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 581, in convert_column
          raise e
        File "/home/migwell/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 575, in convert_column
          result = pa.array(col, type=type_, from_pandas=True, safe=safe)
        File "pyarrow/array.pxi", line 312, in pyarrow.lib.array
        File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
        File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
      pyarrow.lib.ArrowInvalid: ('Could not convert <__main__.Foo object at 0x7fc23e38bfd0> with type Foo: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column a with type object')

      Now, I realise that there's this disclaimer about arbitrary object serialization: https://arrow.apache.org/docs/python/ipc.html#arbitrary-object-serialization. However it isn't clear how this applies to parquet. In my case, I want to have a well-formed parquet file that has binary blobs in one column that can be deserialized to my class, but can otherwise be read by general parquet tools without failing. Using pickle doesn't solve this use case since other languages like R may not be able to read the pickle file.

      Alternatively, if there is a well-defined protocol for telling pyarrow how to translate a given type to and from arrow types, I would be happy to use that instead.

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            multimeric Michael Milton
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: