Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
None
Description
From https://github.com/apache/arrow/pull/4979#issuecomment-517593918.
When creating a ListArray from offsets and values in python, there is no validation of the offsets that it starts with 0 and ends with the length of the array (but is that required? the docs seem to indicate that: https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type ("The first value in the offsets array is 0, and the last element is the length of the values array.").
The array you get "seems" ok (the repr), but on conversion to python or flattened arrays, things go wrong:
In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) In [62]: a Out[62]: <pyarrow.lib.ListArray object at 0x7fdd9c468678> [ [ 1, 2 ], [ 3, 4 ] ] In [63]: a.flatten() Out[63]: <pyarrow.lib.Int64Array object at 0x7fdd9cbfe9e8> [ 0, # <--- includes the 0 1, 2, 3, 4 ] In [64]: a.to_pylist() Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]] # <--includes more elements as garbage
Calling validate manually correctly raises:
In [65]: a.validate() ... ArrowInvalid: Final offset invariant not equal to values length: 10!=5
In C++ the main constructors are not safe, and as the caller you need to ensure that the data is correct or call a safe (slower) constructor. But do we want to use the unsafe / fast constructors without validation in Python as default as well? Or should we do a call to validate here?
A quick search seems to indicate that `pa.Array.from_buffers` does validation, but other `from_arrays` method don't seem to explicitly do this.
Attachments
Issue Links
- links to