[ARROW-15837] [C++][Python][Doc] ListArray.offsets is wrong when it contains both lists and null values - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 7.0.0
Fix Version/s: 8.0.0
Component/s: C++, Documentation, Python
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/31277

Description

Hi ! I noticed this bug by running this code:

import pyarrow as pa

arr = pa.array([None, [0]])
reconstructed_arr = pa.ListArray.from_arrays(arr.offsets, arr.values)
print(reconstructed_arr.to_pylist())
# [[], [0]]

The resulting array, reconstructed from the offsets and values of the original array, is not the same at the original array.

This is the case because it seems that `arr.offsets` is wrong. Indeed it returns `[0, 0, 1]` instead of `[None, 0, 1]`:

print(arr.offsets.to_pylist())
# [0, 0, 1]

fixed_reconstructed_arr = pa.ListArray.from_arrays(pa.array([None, 0, 1]), arr.values)
print(fixed_reconstructed_arr.to_pylist())
# [None, [0]]

If it can help, here is my investigation:

The offsets seem to be wrong because they don't include the validity bitmap from `arr.buffers()[0]`, which is used to say which values are null and which values are non-null. Therefore the `None` is replaced by `0`.

Though even if the validity bitmap is not taken into account at all, I checked its value and it was not what I expected: the validity bitmap at `arr.buffers()[0]` is supposed to be `110` (in order to mask the None in `[None, 0, 1]`) but it is `10` for some reason:

bin(int(arr.buffers()[0].hex(), 16))
# '0b10'
# I think it should be 0b110 - 1 corresponds to non-null and 0 corresponds to null, if you take the bits in reverse order

Attachments

Issue Links

links to

GitHub Pull Request #12557

Activity

People

Assignee:: Antoine Pitrou

Reporter:: quentin lhoest

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 03/Mar/22 17:09

Updated:: 11/Jan/23 11:40

Resolved:: 08/Mar/22 11:04

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 50m