Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
0.15.1
Description
When an arrow dictionary of values=strings and indices=intx is written to parquet and recovered, the indices that correspond to null positions are not written. This causes two problems:
- when transposing the dictionary, the code encounters indices that are out of bounds with the existing dictionary. This does cause crashes.
- a potential security risk because it's unclear whether bytes can be read back inadvertently.
I traced using GDB and found that:
- My dictionary indices were decoded by RleDecoder::GetBatchSpaced. When the valid bit is unset, that function increments "out" but does not set it. I think it should write a 0. https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/rle_encoding.h#L396
- The recovered data "out" array is written to the dictionary builder using an AppendIndices which moves the memory as a bulk move without checking for nulls. Hence we end-up with the indices buffer holding the "out" from above. https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding.cc#L1670 When transpose runs on this (https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util.cc#L406), it may attempt to access memory out of bounds.
While is would be possible to fix "transpose" and other functions that process dictionary indices (e.g. compare for sorting), it seems safer to initialize to 0. Also that's the default behavior for the arrow dict builder when appending one or more nulls.
Incidentally the code recovers the dict with indices int32 instead of the original int8 but I guess that this is covered by another activity.
Attachments
Issue Links
- links to