Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
4.0.0
Description
After a non-match, the subsequent string may match ... but its data is in the wrong array element.
>>> pa.compute.extract_regex(pa.array(["a", "b", "c", "d"]), pattern="(?P<x>[^b])") <pyarrow.lib.StructArray object at 0x7f80de918ee0> -- is_valid: [ true, false, true, true ] -- child 0 type: string [ "a", "", "", "c" ]
Same if trying to match after null:
>>> pa.compute.extract_regex(pa.array(["a", None, "c", "d", "e"]), pattern="(?P<x>[^b])") <pyarrow.lib.StructArray object at 0x7f80de918ee0> -- is_valid: [ true, false, true, true, true ] -- child 0 type: string [ "a", "", "", "c", "d" ]
Workaround: 1) filter out non-matches; 2) extract only the matching strings; 3) interpolate nulls:
def _extract_regex_workaround_arrow_12670( array: pa.StringArray, *, pattern: str ) -> pa.StructArray: ok = pa.compute.match_substring_regex(array, pattern=pattern) good = array.filter(ok) good_matches = pa.compute.extract_regex(good, pattern=pattern) # Build array that looks like [None, 1, None, 2, 3, 4, None, 5] # ... ok_nonnull: [False, True, False, True, True, True, False, True] # (not ok.fill_null(False).cast(pa.int8()) because of ARROW-12672 segfault) ok_nonnull = pa.compute.and_kleene(ok.is_valid(), ok) # ... np_ok: [0, 1, 0, 1, 1, 1, 0, 1] np_ok = ok_nonnull.cast(pa.int8()).to_numpy(zero_copy_only=False) # ... np_index: [0, 1, 1, 2, 3, 4, 4, 5] np_index = np.cumsum(np_ok, dtype=np.int64) - 1 # ...index_or_null: [None, 1, None, 3, 4, 5, None, 5] valid = ok_nonnull.buffers()[1] index_or_null = pa.Array.from_buffers( pa.int64(), len(array), [valid, pa.py_buffer(np_index)] ) return good_matches.take(index_or_null)
Attachments
Issue Links
- links to