Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-4814

[Python] Exception when writing nested columns that are tuples to parquet

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Resolved
    • Affects Version/s: 0.12.1
    • Fix Version/s: None
    • Component/s: Python
    • Labels:
    • Environment:
      4.20.8-100.fc28.x86_64

      Description

      I get an exception when I try to write a pandas.DataFrame to a parquet file where one of the columns has tuples in them. I use tuples here because it allows for easier querying in pandas (see ARROW-3806 for a more detailed description).

      Traceback (most recent call last):
        File "df_to_parquet_fail.py", line 5, in <module>
          df.to_parquet("test.parquet")  # crashes
        File "/home/user/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 2203, in to_parquet                                                                                       
          partition_cols=partition_cols, **kwargs)
        File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parquet.py", line 252, in to_parquet                                                                                        
          partition_cols=partition_cols, **kwargs)
        File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parquet.py", line 113, in write                                                                                             
          table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
        File "pyarrow/table.pxi", line 1141, in pyarrow.lib.Table.from_pandas
        File "/home/user/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 431, in dataframe_to_arrays                                                                           
          convert_types)]
        File "/home/user/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 430, in <listcomp>                                                                                    
          for c, t in zip(columns_to_convert,
        File "/home/user/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 426, in convert_column                                                                                
          raise e
        File "/home/user/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 420, in convert_column                                                                                
          return pa.array(col, type=ty, from_pandas=True, safe=safe)
        File "pyarrow/array.pxi", line 176, in pyarrow.lib.array
        File "pyarrow/array.pxi", line 85, in pyarrow.lib._ndarray_to_array
        File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
      pyarrow.lib.ArrowInvalid: ("Could not convert ('G',) with type tuple: did not recognize Python value type when inferring an Arrow data type", 'Conversion failed for column ALTS with type object')
      

      The issue maybe replicated with the attached script and csv file.

        Attachments

        1. df_to_parquet_fail.py
          0.2 kB
          Suvayu Ali
        2. test.csv
          0.1 kB
          Suvayu Ali

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                suvayu Suvayu Ali
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: