Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-6132

[Python] ListArray.from_arrays does not check validity of input arrays

    XMLWordPrintableJSON

Details

    Description

      From https://github.com/apache/arrow/pull/4979#issuecomment-517593918.

      When creating a ListArray from offsets and values in python, there is no validation of the offsets that it starts with 0 and ends with the length of the array (but is that required? the docs seem to indicate that: https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type ("The first value in the offsets array is 0, and the last element is the length of the values array.").

      The array you get "seems" ok (the repr), but on conversion to python or flattened arrays, things go wrong:

      In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) 
      
      In [62]: a
      Out[62]: 
      <pyarrow.lib.ListArray object at 0x7fdd9c468678>
      [
        [
          1,
          2
        ],
        [
          3,
          4
        ]
      ]
      
      In [63]: a.flatten()
      Out[63]: 
      <pyarrow.lib.Int64Array object at 0x7fdd9cbfe9e8>
      [
        0,   # <--- includes the 0
        1,
        2,
        3,
        4
      ]
      
      In [64]: a.to_pylist()
      Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]]  # <--includes more elements as garbage
      

      Calling validate manually correctly raises:

      In [65]: a.validate()
      ...
      ArrowInvalid: Final offset invariant not equal to values length: 10!=5
      

      In C++ the main constructors are not safe, and as the caller you need to ensure that the data is correct or call a safe (slower) constructor. But do we want to use the unsafe / fast constructors without validation in Python as default as well? Or should we do a call to validate here?

      A quick search seems to indicate that `pa.Array.from_buffers` does validation, but other `from_arrays` method don't seem to explicitly do this.

      Attachments

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m