Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-7435

Security issue: ValidateOffsets() does not prevent buffer over-read

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.15.1, 0.16.0
    • Fix Version/s: 0.16.0
    • Component/s: C++, Python
    • Environment:
      Docker

      Description

      Skimming through Validate() code in both 0.15 and master, I noticed an oversight in BinaryArray validation in C++ (and Python).

      ValidateOffsets() checks that the first offset is 0, but it doesn't check that the offsets all point within the data buffer. A nefarious Arrow file could write offsets=[0,999999] and data=[]. If a caller reads the first value in that array, that will produce a buffer over-read.

      Validation is cheap, since Arrow already validates that offsets are monotonically increasing. One need only test that the last offset is less than or equal to the size of the data buffer.

      We at Workbench are letting untrusted programs write Arrow files that we then validate and read. We're keen to ensure Arrow files don't allow untrusted programs to plant data that leads to arbitrary code execution or arbitrary reads. We wrote a validation tool that checks this buffer over-read I describe here: https://github.com/CJWorkbench/arrow-tools/blob/005fe582b428c1ab6a9ed5f6dc968387d77e9a80/src/arrow-validate.cc#L27. But it feels to me like Arrow's Validate() should be checking this.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                apitrou Antoine Pitrou
                Reporter:
                adamhooper Adam Hooper
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m