Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-18141

[C++] Alignment not enforced; undefined behavior in Parquet writer

    XMLWordPrintableJSON

Details

    Description

      It is possible to create arrays using unaligned memory addresses (e.g. for int64). This seems to be in line with the arrow specification which as far as I understand does not require alignment [1].

      However, the C++ standard requires alignment, e.g. 8 byte alignment for int64. It is undefined behavior (UB) to create an unaligned pointer / accessing data via an unaligned pointer.

      Typically, this is not an issue in practice on x86, since gcc and other compilers mostly emit instructions that can deal with unaligned data. However, for gcc 6.3.0 (and probably up to including gcc versions 7.X), this code:

      https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/cpp/src/parquet/statistics.cc#L355

      creates an aligned move instruction (movdqa) for the expression values[i]. This, in turn, triggers a SIGSEGV in case values is called via an unaligned buffer. Later compiler versions (in particular gcc 9.X used to build the wheels published on pypi) will emit instructions that can deal with unaligned data (movdqu instead of movdqa).

      The python script "test1.py" reproduces this issue on python-level; note that it will only trigger a SIGSEGV if compiling arrow with a compiler that emits movdqa for the code linked above, e.g. by using gcc 6.3.0 to compile arrow.

      In the wild, unaligned buffers are rare, but can appear, e.g. as a result of deserializing pandas dataframes / numpy arrays using pickle protocol 5 that allows out-of-band byte buffers that are re-used as arrow array buffers.

      I think the line to first enter the UB regime is this reinterpret_cast:

      https://github.com/apache/arrow/blob/33f2c0ec8e281fc4fe8c03b07ed2d32e343d9b0e/cpp/src/parquet/column_writer.cc#L1592

      [1]https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding merely "recommends" that buffers are aligned, but does not require it.

      Attachments

        1. test1.py
          1 kB
          Jochen Ott

        Issue Links

          Activity

            People

              apitrou Antoine Pitrou
              jott Jochen Ott
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 50m
                  50m