Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-12774

[C++][Compute] replace_substring_regex() creates invalid arrays => crash

    XMLWordPrintableJSON

Details

    Description

      min

      arr = pa.array(['A'] * 16)
      arr2 = pa.compute.replace_substring_regex(arr, pattern="X", replacement="Y")
      arr2.validate(full=True)
      

      Expected results: a valid array
      Actual results: pyarrow.lib.ArrowInvalid: Offset invariant failure: non-monotonic offset at slot 64: 0 < 63

      So if you run arr.diff(arr2), you'll get something like:

      terminate called after throwing an instance of 'std::length_error'
        what():  basic_string::_S_create
      Aborted (core dumped)
      

      This seems to happen if and only if the input array length is a multiple of 16. That leads to an ugly workaround:

      def replace_substring_regex_workaround_12774(
          array: pa.Array,
          *,
          pattern: str,
          replacement: str
      ) -> pa.Array:
          if len(array) > 0 and len(array) % 16 == 0:
              chunked_array = pa.chunked_array([array.slice(0, 1), array.slice(1)], type=array.type)
              return pa.compute.replace_substring_regex(
                  chunked_array,
                  pattern=pattern,
                  replacement=replacement
              ).combine_chunks()
          else:
              return pa.compute.replace_substring_regex(
                  array,
                  pattern=pattern,
                  replacement=replacement
              )
      

      Attachments

        Issue Links

          Activity

            People

              niranda Niranda Perera
              adamhooper Adam Hooper
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 50m
                  50m