[ARROW-1941] Table <–> DataFrame roundtrip failing - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.8.0
Fix Version/s: 0.9.0
Component/s: Python
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/17931

Description

Although it is possible to create an Arrow table with a column containing only empty lists (cast to a particular type, e.g. string), in a roundtrip through pandas the original type is lost, it seems, and subsequently attempts to convert to pandas then fail.

To reproduce in PyArrow 0.8.0:

import pyarrow as pa

# Create table with array of empty lists, forced to have type list(string)
arrays = {
    'c1': pa.array([["test"], ["a", "b"], None], type=pa.list_(pa.string())),
    'c2': pa.array([[], [], []], type=pa.list_(pa.string())),
}
rb = pa.RecordBatch.from_arrays(list(arrays.values()), list(arrays.keys()))
tbl = pa.Table.from_batches([rb])
print("Schema 1 (correct):\n{}".format(tbl.schema))

# First roundtrip changes schema
df = tbl.to_pandas()
tbl2 = pa.Table.from_pandas(df)
print("\nSchema 2 (wrong):\n{}".format(tbl2.schema))

# Second roundtrip explodes
df2 = tbl2.to_pandas()

This results in the following output:

Schema 1 (correct):
c1: list<item: string>
  child 0, item: string
c2: list<item: string>
  child 0, item: string

Schema 2 (wrong):
c1: list<item: string>
  child 0, item: string
c2: list<item: null>
  child 0, item: null
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
            b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
            b' [{"name": "c1", "field_name": "c1", "pandas_type": "list[unicod'
            b'e]", "numpy_type": "object", "metadata": null}, {"name": "c2", "'
            b'field_name": "c2", "pandas_type": "list[float64]", "numpy_type":'
            b' "object", "metadata": null}, {"name": null, "field_name": "__in'
            b'dex_level_0__", "pandas_type": "int64", "numpy_type": "int64", "'
            b'metadata": null}], "pandas_version": "0.21.1"}'}

...

> ArrowNotImplementedError: Not implemented type for list in DataFrameBlock: null

I.e., the array of empty lists of strings gets converted into an array of lists of type null, and in the pandas schema to lists of type float64.

If one changes the empty lists to values of None in the creation of the record batches, the roundtrip doesn't explode, but it will silently convert the column to a simple column of type float (i.e. I lose the list type) in pandas. This doesn't help, since other batches from the same source might have non-empty lists and would end up with a different inferred schema, and so can't be concatenated into a single table.

(If this attempt at a double roundtrip seems weird, in my use case I receive data from a server in RecordBatches, which I convert to pandas for manipulation. I then serialize this data to disk using Arrow, and later need to read it back into pandas again for further manipulation. So I need to be able to go through various rounds of table->df->table->df->table etc., where at any time a record batch may have columns that contain only empty lists).

Attachments

Issue Links

links to

GitHub Pull Request #1449

Activity

People

Assignee:: Licht Takeuchi

Reporter:: Thomas Buhrmann

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 20/Dec/17 09:54

Updated:: 11/Jan/23 07:17

Resolved:: 02/Jan/18 16:43