Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8006

[C++] Unsafe arrow dictionary recovered from parquet

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 0.15.1
    • 0.17.0
    • C++

    Description

      When an arrow dictionary of values=strings and indices=intx is written to parquet and recovered, the indices that correspond to null positions are not written. This causes two problems:

      • when transposing the dictionary, the code encounters indices that are out of bounds with the existing dictionary. This does cause crashes.
      • a potential security risk because it's unclear whether bytes can be read back inadvertently.

      I traced using GDB and found that:

      1. My dictionary indices were decoded by RleDecoder::GetBatchSpaced. When the valid bit is unset, that function increments "out" but does not set it. I think it should write a 0. https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/rle_encoding.h#L396
      2. The recovered data "out" array is written to the dictionary builder using an AppendIndices which moves the memory as a bulk move without checking for nulls. Hence we end-up with the indices buffer holding the "out" from above. https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding.cc#L1670 When transpose runs on this (https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util.cc#L406), it may attempt to access memory out of bounds.

      While is would be possible to fix "transpose" and other functions that process dictionary indices (e.g. compare for sorting), it seems safer to initialize to 0. Also that's the default behavior for the arrow dict builder when appending one or more nulls.

      Incidentally the code recovers the dict with indices int32 instead of the original int8 but I guess that this is covered by another activity.

       

      Attachments

        Issue Links

          Activity

            People

              apitrou Antoine Pitrou
              belzilep Pierre Belzile
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 50m
                  50m