Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
1.6.0
-
None
Description
I think it is a common practice when inserting table data as parquet file, one would always reuse the same object among rows, and if a column is byte[] of fixed length, the byte[] would also be reused.
If I use ByteArrayBackedBinary for my byte[], the bug occurs: All of the row groups created by a single task would have the same max & min binary value, just as the last row's binary content.
The reason is BinaryStatistic just keep max & min as parquet.io.api.Binary references, since I use ByteArrayBackedBinary for byte[], the real content of max & min would always point to the reused byte[], therefore the latest row's content.
Does parquet declare somewhere that the user shouldn't reuse byte[] for Binary type? If it doesn't, I think it's a bug and can be reproduced by Spark SQL's RowWriteSupport
The related Spark JIRA ticket: SPARK-6859
Attachments
Issue Links
- blocks
-
PARQUET-292 Release Parquet 1.8.0
- Resolved
- Is contained by
-
PARQUET-77 Improvements in ByteBuffer read path
- Resolved
- is duplicated by
-
PARQUET-326 Binary statistics are invalid if buffers are reused
- Resolved
-
PARQUET-258 Binary statistics is not updated correctly if an underlying Binary array is modified in place
- Resolved
- links to