Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-12670

[C++] extract_regex gives bizarre behavior after nulls or non-matches

    XMLWordPrintableJSON

Details

    Description

      After a non-match, the subsequent string may match ... but its data is in the wrong array element.

      >>> pa.compute.extract_regex(pa.array(["a", "b", "c", "d"]), pattern="(?P<x>[^b])")
      <pyarrow.lib.StructArray object at 0x7f80de918ee0>
      -- is_valid:
        [
          true,
          false,
          true,
          true
        ]
      -- child 0 type: string
        [
          "a",
          "",
          "",
          "c"
        ]
      

      Same if trying to match after null:

      >>> pa.compute.extract_regex(pa.array(["a", None, "c", "d", "e"]), pattern="(?P<x>[^b])")
      <pyarrow.lib.StructArray object at 0x7f80de918ee0>
      -- is_valid:
        [
          true,
          false,
          true,
          true,
          true
        ]
      -- child 0 type: string
        [
          "a",
          "",
          "",
          "c",
          "d"
        ]
      

      Workaround: 1) filter out non-matches; 2) extract only the matching strings; 3) interpolate nulls:

      def _extract_regex_workaround_arrow_12670(
          array: pa.StringArray, *, pattern: str
      ) -> pa.StructArray:
          ok = pa.compute.match_substring_regex(array, pattern=pattern)
          good = array.filter(ok)
          good_matches = pa.compute.extract_regex(good, pattern=pattern)
      
          # Build array that looks like [None, 1, None, 2, 3, 4, None, 5]
          # ... ok_nonnull: [False, True, False, True, True, True, False, True]
          # (not ok.fill_null(False).cast(pa.int8()) because of ARROW-12672 segfault)
          ok_nonnull = pa.compute.and_kleene(ok.is_valid(), ok)
          # ... np_ok: [0, 1, 0, 1, 1, 1, 0, 1]
          np_ok = ok_nonnull.cast(pa.int8()).to_numpy(zero_copy_only=False)
          # ... np_index: [0, 1, 1, 2, 3, 4, 4, 5]
          np_index = np.cumsum(np_ok, dtype=np.int64) - 1
          # ...index_or_null: [None, 1, None, 3, 4, 5, None, 5]
          valid = ok_nonnull.buffers()[1]
          index_or_null = pa.Array.from_buffers(
              pa.int64(), len(array), [valid, pa.py_buffer(np_index)]
          )
      
          return good_matches.take(index_or_null)
      

      Attachments

        Issue Links

          Activity

            People

              apitrou Antoine Pitrou
              adamhooper Adam Hooper
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h