Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
3.0.0, 4.0.0, 5.0.0
Description
I'm trying to explode a table (in the pandas sense: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html)
As it's not yet supported, I've writen some code to do it using a mix of list_flatten and list_parent_indices. It works well, excepted it crashed when for empty tables where it crashes.
WARNING: Logging before InitGoogleLogging() is written to STDERR F0730 15:16:05.164858 13612 chunked_array.cc:48] Check failed: (chunks_.size()) > (0) cannot construct ChunkedArray from empty vector and omitted type *** Check failure stack trace: ***Process finished with exit code 134 (interrupted by signal 6: SIGABRT)
Here's a reproducable example:
import sys import pyarrow as pa from pyarrow import compute import pandas as pd table = pa.Table.from_arrays( [ pa.array([101, 102, 103], pa.int32()), pa.array([['a'], ['a', 'b'], ['a', 'b', 'c']], pa.list_(pa.string())) ], names=['key', 'list'] ) def explode(table) -> pd.DataFrame: exploded_list = compute.list_flatten(table['list']) indices = compute.list_parent_indices(table['list']) assert indices.type == pa.int32() keys = compute.take(table['key'], indices) # <--- Crashes here return pa.Table.from_arrays( [keys, exploded_list], names=['key', 'list_element'] ) explode(table).to_pandas().to_markdown(sys.stdout) explode(table.slice(0, 0)).to_pandas().to_markdown(sys.stdout) # <--- doesn't work
I've narrowed it down to the following:
when list_parent_indices is called on an empty table it returns this empty chunk array:
pa.chunked_array([], pa.int32())
Instead of this chunked array with 1 empty chunk:
pa.chunked_array([pa.array([], pa.int32())])
In turn take doesn't work with the empty chunked aray:
compute.take(pa.chunked_array([pa.array([], pa.int32())]), pa.chunked_array([], pa.int32())) # Bad compute.take(pa.chunked_array([pa.array([], pa.int32())]), pa.chunked_array([pa.array([], pa.int32())])) # Good
Now in terms of how to fix it there's two solutions:
- take could accept empty chunked array
- list_parent_indices could return a chunked array with an empty chunk
PS: the error message isn't accurate. It says "cannot construct ChunkedArray from empty vector and omitted type". But the array being passed has got a type (int32) but no chunk. It makes me suspect that something in take strip the type of the empty chunked array.
Attachments
Issue Links
- links to