[PARQUET-251] Binary column statistics error when reuse byte[] among rows - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.6.0
Fix Version/s: 1.8.0
Component/s: parquet-mr
Labels:
None

Description

I think it is a common practice when inserting table data as parquet file, one would always reuse the same object among rows, and if a column is byte[] of fixed length, the byte[] would also be reused.

If I use ByteArrayBackedBinary for my byte[], the bug occurs: All of the row groups created by a single task would have the same max & min binary value, just as the last row's binary content.

The reason is BinaryStatistic just keep max & min as parquet.io.api.Binary references, since I use ByteArrayBackedBinary for byte[], the real content of max & min would always point to the reused byte[], therefore the latest row's content.

Does parquet declare somewhere that the user shouldn't reuse byte[] for Binary type? If it doesn't, I think it's a bug and can be reproduced by Spark SQL's RowWriteSupport

The related Spark JIRA ticket: SPARK-6859

Attachments

Issue Links

blocks

PARQUET-292 Release Parquet 1.8.0

Resolved

Is contained by

PARQUET-77 Improvements in ByteBuffer read path

Resolved

is duplicated by

PARQUET-326 Binary statistics are invalid if buffers are reused

Resolved

PARQUET-258 Binary statistics is not updated correctly if an underlying Binary array is modified in place

Resolved

links to

PR #197

Activity

People

Assignee:: Ashish Singh

Reporter:: Yijie Shen

Votes:: 2 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 13/Apr/15 01:11

Updated:: 23/Jun/24 03:27

Resolved:: 01/Jul/15 01:34