Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-15247

[Python] Convert array of Pandas dataframe to struct column

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 6.0.1
    • None
    • Python
    • None

    Description

      Currently, converting a Pandas dataframe with a column of dataframes to Arrow fails with "Could not convert <data> with type DataFrame: did not recognize Python value type when inferring an Arrow data type". We should be able to convert this to a List<Struct> array, similar to how the R binding do it. This could even be bi-directional, where structs could be parsed back into a column of dataframe in to_pandas()

      Here is an example that currently fails:

      import pandas as pd
      import pyarrow as pa
      
      df1 = pd.DataFrame({
          'x': [1, 2, 3],
          'y': ['a', 'b', 'c']
      })
      
      df = pd.DataFrame({
          'df': [df1]*10
      })
      
      pa.Table.from_pandas(df)
      

      Here's what the other directly might look like for the same data:

      sub_tab = [{'x': 1, 'y': 'a'},
                 {'x': 2, 'y': 'b'},
                 {'x': 3, 'y': 'c'}]
      
      tab = pa.table({
          'df': pa.array([sub_tab]*10)
      })
      
      print(tab.schema)
      # df: list<item: struct<x: int64, y: string>>
      #    child 0, item: struct<x: int64, y: string>
      #       child 0, x: int64
      #       child 1, y: string
      
      tab.to_pandas()
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            wjones127 Will Jones
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: