[ARROW-8006] [C++] Unsafe arrow dictionary recovered from parquet - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 0.15.1
Fix Version/s: 0.17.0
Component/s: C++
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/24222

Description

When an arrow dictionary of values=strings and indices=intx is written to parquet and recovered, the indices that correspond to null positions are not written. This causes two problems:

when transposing the dictionary, the code encounters indices that are out of bounds with the existing dictionary. This does cause crashes.
a potential security risk because it's unclear whether bytes can be read back inadvertently.

I traced using GDB and found that:

My dictionary indices were decoded by RleDecoder::GetBatchSpaced. When the valid bit is unset, that function increments "out" but does not set it. I think it should write a 0. https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/rle_encoding.h#L396
The recovered data "out" array is written to the dictionary builder using an AppendIndices which moves the memory as a bulk move without checking for nulls. Hence we end-up with the indices buffer holding the "out" from above. https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding.cc#L1670 When transpose runs on this (https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util.cc#L406), it may attempt to access memory out of bounds.

While is would be possible to fix "transpose" and other functions that process dictionary indices (e.g. compare for sorting), it seems safer to initialize to 0. Also that's the default behavior for the arrow dict builder when appending one or more nulls.

Incidentally the code recovers the dict with indices int32 instead of the original int8 but I guess that this is covered by another activity.

Attachments

Issue Links

links to

GitHub Pull Request #6544

Activity

People

Assignee:: Antoine Pitrou

Reporter:: Pierre Belzile

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 05/Mar/20 03:21

Updated:: 11/Jan/23 07:57

Resolved:: 06/Mar/20 01:03

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

50m