Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Duplicate
-
0.7.1
Description
Created from https://github.com/apache/arrow/issues/1208
Hi,
Not sure if this is related or the same as ARROW-1584, but I can't seem to find a way to handle arrays of lists which occasionally consist of empty lists only.
To reproduce:
na = [] # None, [""] arrays = { 'c1': pa.array([["test"], na, na], type=pa.list_(pa.string())), 'c2': pa.array([na, na, na], type=pa.list_(pa.string())), } rb = pa.RecordBatch.from_arrays(list(arrays.values()), list(arrays.keys())) df = rb.to_pandas() pa.serialize_pandas(df) # > ArrowNotImplementedError: Unable to convert type: null tbl = pa.Table.from_pandas(df) sink = pa.BufferOutputStream() writer = pa.RecordBatchFileWriter(sink, tbl.schema) writer.write_table(tbl) # > ArrowNotImplementedError: Unable to convert type: null
In my use case I'm processing data in batches where individual fields contain lists of strings. Some of the batches may, however, contain empty lists only. And there doesn't seem to be any representation in Arrow at the moment to deal with this situation.
Also, since I'm serializing the batches into a single file/stream, their schemas need to be consistent, which is why I tried explicitly specifying the type of the array as list_(string). The only workaround I've found is to replace empty lists with [""], but that implies lots of unnecessary glue code on the client side. Is there a better workaround until this is fixed in an official conda release?
Attachments
Issue Links
- links to