Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-12889

[Python] compute.replace_substring_regex sometimes returns incorrect offsets, causing crashes/ub

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 4.0.0
    • None
    • Python
    • ubuntu 20.04 or macos catalina running docker engine 20.10.2 and python 3.8.6

    Description

      I've come across examples where calling `pyarrow.compute.replace_substring_regex` caused a segfault once using the result. After some experimentation, I found that the problem lies in the offsets buffer in the result of the computation.

      Here is a docker file that reproduces the problem in a few lines (though without an immediate crash):

      FROM python:3.8
      RUN pip install pyarrow
      RUN echo "import pyarrow; \
          import pyarrow.compute; \
          options = pyarrow.compute.ReplaceSubstringOptions('a', ''); \
          values = [''] * 16; \
          arr = pyarrow.array(values, pyarrow.string()); \
          res = pyarrow.compute.replace_substring_regex(arr, options=options); \
          offsets = res.buffers()[1]; \
          assert any(offset != 0 for offset in offsets[-4:]);" > /test.py
      RUN python /test.py
      

      The docker image installs pyarrow (4.0.0 at the time of submitting this issue), and then runs python code which creates an array of 16 empty strings, and calls `replace_substring_regex` on the array.
      The offsets buffer's last 4 bytes (representing the last offset) are checked to be non-zero, which fails.

      Everything but the last offset looks fine: the valid buffer, the rest of the offsets, and the data buffer.

      I have more elaborate examples of arrays which return a random value for the last offset, causing crashes sooner than simply 0 at the end.
      Another hint which might help, the problem occurs at multiples of 16, i.e. changing 16 to 32, 48, etc. still shows the problem, but other values don't have a problem.
       
      When I cloned the latest master, built arrow, and run the example - there was no problem. But since I didn't see the issue here on JIRA, I thought I should probably post it. I have no idea if I'm building correctly, and maybe I'm adding a bug to a bug

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              drorspei Dror Speiser
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: