[PARQUET-62] DictionaryValuesWriter dictionaries are corrupted by user changes. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.6.0
Component/s: parquet-mr
Labels:
None

Description

DictionaryValuesWriter passes incoming Binary objects directly to Object2IntMap to accumulate dictionary values. If the arrays backing the Binary objects passed in are reused by the caller, then the values are corrupted but still written without an error.

Because Hadoop reuses objects passed to mappers and reducers, this can happen easily. For example, Avro reuses the byte arrays backing Utf8 objects, which parquet-avro passes wrapped in a Binary object to writeBytes.

The fix is to make defensive copies of the values passed to the Dictionary writer code. I think this only affects the Binary dictionary classes because Strings, floats, longs, etc. are immutable.

Attachments

Issue Links

relates to

PARQUET-326 Binary statistics are invalid if buffers are reused

Resolved

Activity

People

Assignee:: Ryan Blue

Reporter:: Ryan Blue

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 02/Aug/14 03:20

Updated:: 23/Jun/24 03:26

Resolved:: 20/Aug/14 21:02