Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-5286

[Python] support Structs in Table.from_pandas given a known schema

    XMLWordPrintableJSON

Details

    Description

      ARROW-2073 implemented creating a StructArray from an array of tuples (in addition to from dicts).
      This works in pyarrow.array (specifying the proper type):

      In [2]: df = pd.DataFrame({'tuples': [(1, 2), (3, 4)]})                                                                                                       
      
      In [3]: struct_type = pa.struct([('a', pa.int64()), ('b', pa.int64())])                                                                                       
      
      In [4]: pa.array(df['tuples'], type=struct_type)                                                                                                              
      Out[4]: 
      <pyarrow.lib.StructArray object at 0x7f1b02ff6818>
      -- is_valid: all not null
      -- child 0 type: int64
        [
          1,
          3
        ]
      -- child 1 type: int64
        [
          2,
          4
        ]
      

      But does not yet work when converting a DataFrame to Table while specifying the type in a schema:

      In [5]: pa.Table.from_pandas(df, schema=pa.schema([('tuples', struct_type)]))                                                                                 
      ---------------------------------------------------------------------------
      KeyError                                  Traceback (most recent call last)
      ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in get_logical_type(arrow_type)
           68     try:
      ---> 69         return logical_type_map[arrow_type.id]
           70     except KeyError:
      
      KeyError: 24
      
      During handling of the above exception, another exception occurred:
      
      NotImplementedError                       Traceback (most recent call last)
      <ipython-input-5-c18748f9b954> in <module>
      ----> 1 pa.Table.from_pandas(df, schema=pa.schema([('tuples', struct_type)]))
      
      ~/scipy/repos/arrow/python/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()
      
      ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
          483     metadata = construct_metadata(df, column_names, index_columns,
          484                                   index_descriptors, preserve_index,
      --> 485                                   types)
          486     return all_names, arrays, metadata
          487 
      
      ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in construct_metadata(df, column_names, index_levels, index_descriptors, preserve_index, types)
          207         metadata = get_column_metadata(df[col_name], name=sanitized_name,
          208                                        arrow_type=arrow_type,
      --> 209                                        field_name=sanitized_name)
          210         column_metadata.append(metadata)
          211 
      
      ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in get_column_metadata(column, name, arrow_type, field_name)
          149     dict
          150     """
      --> 151     logical_type = get_logical_type(arrow_type)
          152 
          153     string_dtype, extra_metadata = get_extension_dtype_info(column)
      
      ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in get_logical_type(arrow_type)
           77         elif isinstance(arrow_type, pa.lib.Decimal128Type):
           78             return 'decimal'
      ---> 79         raise NotImplementedError(str(arrow_type))
           80 
           81 
      
      NotImplementedError: struct<a: int64, b: int64>
      
      

      Attachments

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m