Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1941

Table <–> DataFrame roundtrip failing

    Details

      Description

      Although it is possible to create an Arrow table with a column containing only empty lists (cast to a particular type, e.g. string), in a roundtrip through pandas the original type is lost, it seems, and subsequently attempts to convert to pandas then fail.

      To reproduce in PyArrow 0.8.0:

      import pyarrow as pa
      
      # Create table with array of empty lists, forced to have type list(string)
      arrays = {
          'c1': pa.array([["test"], ["a", "b"], None], type=pa.list_(pa.string())),
          'c2': pa.array([[], [], []], type=pa.list_(pa.string())),
      }
      rb = pa.RecordBatch.from_arrays(list(arrays.values()), list(arrays.keys()))
      tbl = pa.Table.from_batches([rb])
      print("Schema 1 (correct):\n{}".format(tbl.schema))
      
      # First roundtrip changes schema
      df = tbl.to_pandas()
      tbl2 = pa.Table.from_pandas(df)
      print("\nSchema 2 (wrong):\n{}".format(tbl2.schema))
      
      # Second roundtrip explodes
      df2 = tbl2.to_pandas()
      

      This results in the following output:

      Schema 1 (correct):
      c1: list<item: string>
        child 0, item: string
      c2: list<item: string>
        child 0, item: string
      
      Schema 2 (wrong):
      c1: list<item: string>
        child 0, item: string
      c2: list<item: null>
        child 0, item: null
      __index_level_0__: int64
      metadata
      --------
      {b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
                  b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
                  b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
                  b' [{"name": "c1", "field_name": "c1", "pandas_type": "list[unicod'
                  b'e]", "numpy_type": "object", "metadata": null}, {"name": "c2", "'
                  b'field_name": "c2", "pandas_type": "list[float64]", "numpy_type":'
                  b' "object", "metadata": null}, {"name": null, "field_name": "__in'
                  b'dex_level_0__", "pandas_type": "int64", "numpy_type": "int64", "'
                  b'metadata": null}], "pandas_version": "0.21.1"}'}
      
      ...
      
      > ArrowNotImplementedError: Not implemented type for list in DataFrameBlock: null
      

      I.e., the array of empty lists of strings gets converted into an array of lists of type null, and in the pandas schema to lists of type float64.

      If one changes the empty lists to values of None in the creation of the record batches, the roundtrip doesn't explode, but it will silently convert the column to a simple column of type float (i.e. I lose the list type) in pandas. This doesn't help, since other batches from the same source might have non-empty lists and would end up with a different inferred schema, and so can't be concatenated into a single table.

      (If this attempt at a double roundtrip seems weird, in my use case I receive data from a server in RecordBatches, which I convert to pandas for manipulation. I then serialize this data to disk using Arrow, and later need to read it back into pandas again for further manipulation. So I need to be able to go through various rounds of table->df->table->df->table etc., where at any time a record batch may have columns that contain only empty lists).

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Licht-T Licht Takeuchi
                Reporter:
                buhrmann Thomas Buhrmann
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: