Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-13474

[C++][Python] PyArrow crash when filter/take empty Extension array

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0, 4.0.0, 4.0.1
    • 6.0.0
    • C++, Python
    • Python 3.7, Ubuntu 20.04

    Description

      PyArrow is crashing when applying `filter` or `take` on already empty extension arrays.

      The bug can be reproduced with the documentation example:

      import pyarrow as pa
      
      class Point3DArray(pa.ExtensionArray):
          def to_numpy_array(self):
              return self.storage.flatten().to_numpy().reshape((-1, 3))
      
      
      class Point3DType(pa.PyExtensionType):
          def __init__(self):
              pa.PyExtensionType.__init__(self, pa.list_(pa.float32(), 3))
      
          def __reduce__(self):
              return Point3DType, ()
      
          def __arrow_ext_class__(self):
              return Point3DArray
      
      storage = pa.array([[1, 2, 3], [4, 5, 6]], pa.list_(pa.float32(), 3))
      arr = pa.ExtensionArray.from_storage(Point3DType(), storage)
      arr = arr.filter(pa.array([False, False]))
      
      # Crashing here...
      arr.filter(pa.array([], pa.bool_()))
      # Crashing as well...
      arr.take(pa.array([], pa.int32()))

      The underlying issue seems to be that the function `nulls` is not implemented for extension types in the C++ codebase: https://github.com/apache/arrow/blob/6db88a9e946c98c59f179210a70bc05ef6a0a296/cpp/src/arrow/array/util.cc#L472

      Attachments

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              balancap Paul Balanca
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 10m
                  1h 10m