Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
4.0.0
Description
min
arr = pa.array(['A'] * 16) arr2 = pa.compute.replace_substring_regex(arr, pattern="X", replacement="Y") arr2.validate(full=True)
Expected results: a valid array
Actual results: pyarrow.lib.ArrowInvalid: Offset invariant failure: non-monotonic offset at slot 64: 0 < 63
So if you run arr.diff(arr2), you'll get something like:
terminate called after throwing an instance of 'std::length_error'
what(): basic_string::_S_create
Aborted (core dumped)
This seems to happen if and only if the input array length is a multiple of 16. That leads to an ugly workaround:
def replace_substring_regex_workaround_12774( array: pa.Array, *, pattern: str, replacement: str ) -> pa.Array: if len(array) > 0 and len(array) % 16 == 0: chunked_array = pa.chunked_array([array.slice(0, 1), array.slice(1)], type=array.type) return pa.compute.replace_substring_regex( chunked_array, pattern=pattern, replacement=replacement ).combine_chunks() else: return pa.compute.replace_substring_regex( array, pattern=pattern, replacement=replacement )
Attachments
Issue Links
- supercedes
-
ARROW-12889 [Python] compute.replace_substring_regex sometimes returns incorrect offsets, causing crashes/ub
- Closed
- links to