Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-15837

[C++][Python][Doc] ListArray.offsets is wrong when it contains both lists and null values

    XMLWordPrintableJSON

Details

    Description

      Hi ! I noticed this bug by running this code:

      import pyarrow as pa
      
      arr = pa.array([None, [0]])
      reconstructed_arr = pa.ListArray.from_arrays(arr.offsets, arr.values)
      print(reconstructed_arr.to_pylist())
      # [[], [0]] 

      The resulting array, reconstructed from the offsets and values of the original array, is not the same at the original array.

      This is the case because it seems that `arr.offsets` is wrong. Indeed it returns `[0, 0, 1]` instead of `[None, 0, 1]`:

      print(arr.offsets.to_pylist())
      # [0, 0, 1]
      
      fixed_reconstructed_arr = pa.ListArray.from_arrays(pa.array([None, 0, 1]), arr.values)
      print(fixed_reconstructed_arr.to_pylist())
      # [None, [0]]

      If it can help, here is my investigation:

      The offsets seem to be wrong because they don't include the validity bitmap from `arr.buffers()[0]`, which is used to say which values are null and which values are non-null. Therefore the `None` is replaced by `0`.

      Though even if the validity bitmap is not taken into account at all, I checked its value and it  was not what I expected: the validity bitmap at `arr.buffers()[0]` is supposed to be `110` (in order to mask the None in `[None, 0, 1]`) but it is `10` for some reason:

      bin(int(arr.buffers()[0].hex(), 16))
      # '0b10'
      # I think it should be 0b110 - 1 corresponds to non-null and 0 corresponds to null, if you take the bits in reverse order 

      Attachments

        Issue Links

          Activity

            People

              apitrou Antoine Pitrou
              lhoestq quentin lhoest
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 50m
                  1h 50m